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1  Introduction 

Many  problems  in  learning  theory  can  be  effectively 
modelled  as  learning  an  input  output  mapping  on  the 
basis  of  limited  evidence  of  what  this  mapping  might  be. 
The  mapping  usually  takes  the  form  of  some  unknown 
function  between  two  spaces  and  the  evidence  is  often  a 
set  of  labelled,  noisy,  examples  i.e.,  (*,y)  pairs  which  are 
consistent  with  this  function.  On  the  basis  of  this  data 
set,  the  learner  tries  to  infer  the  true  function. 

Such  a  scenario  of  course  exists  in  a  wide  range  of 
scientific  disciplines.  For  example,  in  speech  recogni¬ 
tion,  there  might  exist  some  functional  relationship  be¬ 
tween  sounds  and  their  phonetic  identities.  We  are  given 
(sound,  phonetic  identity)  pairs  from  which  we  try  to  in¬ 
fer  the  underlying  function.  This  example  from  speech 
recogniton  belongs  to  a  large  class  of  pattern  classifica¬ 
tion  problems  where  the  patterns  could  be  visual,  acous¬ 
tic,  or  tactile.  In  economics,  it  is  sometimes  of  interest 
to  predict  the  future  foreign  currency  rates  on  the  ba¬ 
sis  of  the  past  time  series.  There  might  be  a  function 
which  captures  the  dynamical  relation  between  past  and 
future  currency  rates  and  one  typically  tries  to  uncover 
this  relation  from  data  which  has  been  appropriately  pro¬ 
cessed.  Similarly  in  medicine,  one  might  be  interested  in 
predicting  whether  or  not  breast  cancer  will  recur  in  a 
patient  within  five  years  after  her  treatment.  The  input 
space  might  involve  dimensions  like  the  age  of  the  pa¬ 
tient,  whether  she  has  been  through  menopause,  the  ra¬ 
diation  treatment  previously  used  etc.  The  output  space 
would  be  single  dimensional  boolean  taking  on  values  de¬ 
pending  upon  whether  breast  cancer  recurs  or  not.  One 
might  collect  data  from  case  histories  of  patients  and  try 
to  uncover  the  underlying  function. 

The  unknown  target  function  is  assumed  to  belong  to 
some  class  T  which  using  the  terminology  of  computa¬ 
tional  learning  theory  we  call  the  concept  class.  Typi¬ 
cal  examples  of  concept  classes  are  classes  of  indicator 
functions,  boolean  functions,  Sobolev  spaces  etc.  The 
learner  is  provided  with  a  finite  data  set.  One  can  make 
many  assumptions  about  how  this  data  set  is  collected 
but  a  common  assumption  which  would  suffice  for  our 
purposes  is  that  the  data  is  drawn  by  sampling  inde¬ 
pendently  the  input  output  space  {X  x  T)  according 
to  some  unknown  probability  distribution.  On  the  ba¬ 
sis  of  this  data,  the  learner  then  develops  a  hypothesis 
(another  function)  about  the  identity  of  the  target  func¬ 
tion  i.e.,  it  comes  up  with  a  function  chosen  from  some 
class,  say  H  (the  hypothesis  class)  which  best  fits  the 
data  and  postulates  this  to  be  the  target.  Hypothesis 
classes  could  also  be  of  different  kinds.  For  example, 
they  could  be  classes  of  boolean  functions,  polynomials, 
linear  functions,  spline  functions  and  so  on.  One  such 
class  which  is  being  increasingly  used  for  learning  prob¬ 
lems  is  the  class  of  feedforward  networks  [53], [43],  [35].  A 
typical  feedforward  network  is  a  parametrized  function 
of  the  form 

n 

/(*)  =  '^CiH(x\vri) 
i  =  l 

where  and  are  free  parameters  and 


H {■■,■)  is  a  given,  fixed  function  (the  “activation  func¬ 
tion”).  Depending  on  the  choice  of  the  activation  func¬ 
tion  one  gets  different  network  models,  such  as  the  most 
common  form  of  “neural  networks”,  the  Multilayer  Per- 
ceptron  [74,  18,  51,  43,  44,  30,  57,  56,  46],  or  the  Radial 
Basis  Functions  network  [14,  26,  39,  40,  58,  70,  59,  67, 
66,  32,  35]. 

If,  as  more  and  more  data  becomes  available,  the 
learner’s  hypothesis  becomes  closer  and  closer  to  the  tar¬ 
get  and  converges  to  it  in  the  limit,  the  target  is  said  to 
be  learnable.  The  error  between  the  learner’s  hypothesis 
and  the  target  function  is  defined  to  be  the  generalization 
error  and  for  the  target  to  be  learnable  the  generaliza¬ 
tion  error  should  go  to  zero  as  the  data  goes  to  infinity. 
While  learnability  is  certainly  a  very  desirable  quality,  it 
requires  the  fulfillment  of  two  important  criteria. 

First,  there  is  the  issue  of  the  representational  ca¬ 
pacity  (or  hypothesis  complexity)  of  the  hypothesis  class. 
This  must  have  sufficient  power  to  represent  or  closely 
^'‘proximate  the  concept  class.  Otherwise  for  some  tar¬ 
get  function  /,  the  best  hypothesis  h  in  H  might  be  far 
away  from  it.  The  error  that  this  best  hypothesis  makes 
is  formalized  later  as  the  approximation  error.  In  this 
case,  all  the  learner  can  hope  to  do  is  to  converge  to  h 
in  the  limit  of  infinite  data  and  so  it  will  never  recover 
the  target.  Second,  w*-  do  not  have  infinite  data  but 
only  some  finite  random  sample  set  from  which  we  con¬ 
struct  a  hypothesis.  This  hypothesis  constructed  from 
the  finite  data  might  be  far  from  the  best  possible  hy¬ 
pothesis,  h,  resulting  in  a  further  error.  This  additional 
error  (caused  by  finiteness  of  data)  is  formalized  later  as 
the  estimation  error.  The  amount  of  data  needed  to  en¬ 
sure  a  smeill  estimation  error  is  referred  to  as  the  sample 
complexity  of  the  problem.  The  hypothesis  complexity, 
the  sample  complexity  and  the  generalization  error  are 
related.  If  the  class  H  is  very  large  or  in  other  words 
has  high  complexity,  then  for  the  same  estimation  error, 
the  sample  complexity  increases.  If  the  hypothesis  com¬ 
plexity  is  small,  the  sample  complexity  is  also  small  but 
now  for  the  same  estimation  error  the  approximation  er¬ 
ror  is  high.  This  point  has  been  developed  in  terms  of 
the  Bias- Variance  trade-off  by  Geman  et  al  [31]  in  the 
context  of  neural  networks,  and  others  [72,  38,  80,  75]  in 
statistics  in  general. 

The  purpose  of  this  paper  is  two-fold.  First,  we  for¬ 
malize  the  problem  of  learning  from  examples  so  as  to 
highlight  the  relationship  between  hypothesis  complex¬ 
ity,  sample  complexity  and  total  error.  Second,  we  ex¬ 
plore  this  relationship  in  the  specific  context  of  a  partic¬ 
ular  hypothesis  class.  This  is  the  class  of  Radial  Basis 
function  networks  which  can  be  considered  to  belong  to 
the  broader  class  of  feed-forward  networks.  Specifically, 
we  are  interested  in  asking  the  following  questions  about 
radial  basis  functions. 

Imagine  you  were  interested  in  solving  a  particular 
problem  (regression  or  pattern  classification)  using  Ra¬ 
dial  Basis  Function  networks.  Then,  how  large  must  the 
network  he  and  how  many  examples  do  you  need  to  draw 
so  that  you  are  guaranteed  with  high  confidence  to  do 
very  well?  Conversely,  if  you  had  a  finite  network  and 
a  finite  amount  of  data,  what  are  the  kinds  of  problems 


you  could  solve  effectively? 

Clearly,  if  one  were  using  a  network  with  a  finite 
number  of  parameters,  then  its  representational  capac¬ 
ity  would  be  limited  and  therefore  even  in  the  best  case 
we  would  make  an  approximation  error.  Drawing  upon 
results  in  approximation  theory  [55]  several  researchers 
[18,  41,  6,  44,  15,  3,  57,  56,  46,  76]  have  investigated 
the  approximating  power  of  feedforward  networks  show¬ 
ing  how  as  the  number  of  parameters  goes  to  infinity, 
the  network  can  approximate  any  continuous  function. 
These  results  assume  infinite  data  and  questions  of  learn- 
ability  from  finite  data  are  ignored.  For  a  finite  net¬ 
work,  due  to  finiteness  of  the  data,  we  make  an  error 
in  estimating  the  parameters  and  consequently  have  an 
estimation  error  in  addition  to  the  approximation  er¬ 
ror  mentioned  earlier.  Using  results  from  Vapnik  and 
Chervonenkis  [80,  81,  82,  83]  and  Pollard  [69],  work  has 
eilso  been  done  [42,  9]  on  the  sample  complexity  of  finite 
networks  showing  how  as  the  data  goes  to  infinity,  the 
estimation  error  goes  to  zero  i.e.,  the  empirically  opti¬ 
mized  parameter  settings  coii.cxge  to  the  optimal  ones 
for  that  class.  However,  since  the  number  of  parameters 
are  fixed  and  finite,  even  the  optimal  parameter  setting 
might  yield  a  function  which  is  far  from  the  target.  This 
issue  is  left  unexplored  by  Haussler  [42]  in  an  excellent 
investigation  of  the  sample  complexity  question. 

In  this  paper,  we  explore  the  errors  due  to  both  finite 
parameters  and  finite  data  in  a  common  setting.  In  order 
for  the  total  generalization  error  to  go  to  zero,  both  the 
number  of  parameters  and  the  number  of  data  have  to 
go  to  infinity,  and  we  provide  rates  at  which  they  grow 
for  learnability  to  result.  Further,  as  a  corollary,  we  are 
able  to  provide  a  principled  way  of  choosing  the  optimal 
number  of  parameters  so  as  to  minimize  expected  errors. 
It  should  be  mentioned  here  that  White  [85]  and  Barron 
[7]  have  provided  excellent  treatments  of  this  problem 
for  different  hypothesis  cliisses.  We  will  mention  their 
work  at  appropriate  points  in  this  paper. 

The  plan  of  the  paper  is  as  follows:  in  section  2  we 
will  formalize  the  problem  and  comment  on  issues  of  a 
general  nature.  We  then  provide  in  section  3  a  precise 
statement  of  a  specific  problem.  In  section  4  we  present 
our  main  result,  whose  proof  is  postponed  to  appendix  D 
for  continuity  of  reading.  The  main  result  is  qualified  by 
several  remarks  in  section  5.  In  section  6  we  will  discuss 
what  could  be  the  implications  of  our  result  in  practice 
and  finally  we  conclude  in  section  7  with  a  reiteration  of 
our  essential  points. 


2  Definitions  and  Statement  of  the 
Problem 

In  order  to  make  a  precise  statement  of  the  problem  we 
first  need  to  introduce  some  terminology  and  to  define 
a  number  of  mathematical  objects.  A  summary  of  the 
most  common  notations  and  definitions  used  in  this  pa¬ 
per  can  be  found  in  appendix  A. 


3.1  Random  Variables  and  Probability 
Distributions 

Let  A'  and  Y  be  two  arbitrary  sets.  We  will  call  x 
and  y  the  independent  vartable  and  response  respectively, 
where  x  and  y  range  over  the  generic  elements  of  X  and 
Y .  In  most  cases  X  will  be  a  subset  of  a  h-dimensional 
Euclidean  space  and  Y  a  subset  of  the  real  line,  so  that 
the  independent  variable  will  be  a  h-dimensional  vec¬ 
tor  and  the  response  a  real  number.  We  assume  that  a 
probability  distribution  P(x,  y)  is  defined  on  X  x  Y.  P 
is  unknown,  although  certain  assumptions  on  it  will  be 
made  later  in  this  section. 

The  probability  distribution  P(x,  y)  can  also  be  writ¬ 
ten  as^: 

P(x,y)  =  P(x)P(y|x),  (1) 

where  P(3/|x)  is  the  conditional  probability  of  the  re¬ 
sponse  y  given  the  independent  variable  x,  and  P(x) 
is  the  marginal  probability  of  the  independent  variable 
given  by: 

P(x)  =  •^y  Pi^^y)  ■ 

Expected  values  with  respect  to  P(x,  y)  or  P(x)  will  be 
always  indicated  by  E[-].  Therefore,  we  will  write: 

£'[9(3c,y)]=  /  dxdy  P(x,  y)y(x,  y) 

JXxY 

and 

£'[/i(x)]  =  f  dx  P(x)/i(x) 

Jx 

for  any  arbitrary  function  y  or  /i. 

2.2  Learning  from  Examples  and  Estimators 

The  framework  described  above  can  be  used  to  model 
the  fact  that  in  the  real  world  we  often  have  to  deal  with 
sets  of  variables  that  are  related  by  a  probabilistic  rela¬ 
tionship.  For  example,  y  could  be  the  measured  torque 
at  a  particular  joint  of  a  robot  arm,  and  x  the  set  of  an¬ 
gular  position,  velocity  and  acceleration  of  the  joints  of 
the  arm  in  a  particular  configuration.  The  relationship 
between  x  and  y  is  probabilistic  because  there  is  noise 
affecting  the  measurement  process,  so  that  two  different 
torques  could  be  measured  given  the  same  configuration. 

In  many  cases  we  are  provided  with  examples  of  this 
probabilistic  relationship,  that  is  with  a  data  set  Di,  ob¬ 
tained  by  sampling  I  times  the  set  X  xY  according  to 
P(x,y): 

D,  =  {(xi,yi)e  AxF}'^!  . 

From  eq.  (1)  we  see  that  we  can  think  of  an  element 
{xi,yi)  of  the  data  set  £>/  as  obtained  by  sampling  X 
according  to  P(x),  and  then  sampling  Y  according  to 
P(y|x).  In  the  robot  arm  example  described  above,  it 
would  mean  that  one  could  move  the  robot  arm  into 

‘Note  that  we  are  assuming  that  the  conditional  distribu¬ 
tion  exists,  but  this  is  not  a  very  restrictive  assumption. 
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a  random  configuration  Xi,  measure  the  corresponding 
torque  yi,  and  iterate  this  process  I  times. 

The  interesting  problem  is,  given  an  instance  of  x  that 
does  not  appear  in  the  data  set  Dj,  to  give  an  estimate 
of  what  we  expect  y  to  be.  For  example,  given  a  certain 
configuration  of  the  robot  arm,  we  would  like  to  estimate 
the  corresponding  torque. 

Formally,  we  define  an  estimator  to  be  any  function 
f  ■  X  — ♦  y.  Clearly,  since  the  independent  variable  x 
need  not  determine  uniquely  the  response  y,  any  esti¬ 
mator  will  make  a  certain  amount  of  error.  However,  it 
is  interesting  to  study  the  problem  of  finding  the  best 
possible  estimator,  given  the  knowledge  of  the  data  set 
D(,  and  this  problem  will  be  defined  as  the  problem  of 
learning  from  examples,  where  the  examples  are  repre¬ 
sented  by  the  data  set  Dj .  Thus  we  have  a  probabilistic 
relation  between  x  and  y.  One  can  think  of  this  as  an 
underlying  deterministic  relation  corrupted  with  noise. 
Hopefully  a  good  estimator  will  be  able  to  recover  this 
relation. 

2.3  The  Expected  Risk  and  the  Regression 
Function 

In  the  previous  section  we  explained  the  problem  of 
learning  from  examples  and  stated  that  this  is  the  same 
as  the  problem  of  finding  the  best  estimator.  To  make 
sense  of  this  statement,  we  now  need  to  define  a  mea¬ 
sure  of  how  good  an  estimator  is.  Suppose  we  sample 
X  xY  according  to  P(x,y),  obtaining  the  pair  (x,  y).  A 
measure^  of  the  error  of  the  estimator  /  at  the  point  x 
is; 

(y-  /W)- . 

In  the  example  of  the  robot  arm,  /(x)  is  our  estimate  of 
the  torque  corresponding  to  the  configuration  x,  and  y  is 
the  measured  torque  of  that  configuration.  The  average 
error  of  the  estimator  /  is  now  given  by  the  functional 

7[/]  =  £'[(y-/(x))^]=  /  dxdyP(x,y)(y-/(x)p, 
JXxY 

that  is  usuedly  called  the  expected  risk  of  /  for  the  specific 
choice  of  the  error  measure. 

Given  this  particular  measure  as  our  yardstick  to  evid- 
uate  different  estimators,  we  are  now  interested  in  find¬ 
ing  the  estimator  that  minimizes  the  expected  risk.  In 
order  to  proceed  we  need  to  specify  its  domain  of  def¬ 
inition  T.  Then  using  the  expected  risk  as  a  criterion, 
we  could  obtain  the  best  element  of  X.  Depending  on 
the  properties  of  the  unknown  probability  distribution 
P(x,  y)  one  could  make  different  choices  for  T .  We  will 
assume  in  the  following  that  T  is  some  space  of  differ¬ 
entiable  functions.  For  example,  T  could  be  a  space  of 
functions  with  a  certain  number  of  bounded  derivatives 
(the  spaces  A™(7?‘^)  defined  in  appendix  A),  or  a  Sobolev 
space  of  functions  with  a  certain  number  of  derivatives 
in  Lp  (the  spaces  defined  in  appendix  A). 

^Note  that  this  is  the  famihar  squared-error  and  when 
averaged  over  its  domain  yields  the  mean  squared  error  for  a 
particular  estimator,  a  very  common  choice.  However,  it  is 
useful  to  remember  that  there  could  be  other  choices  as  well. 


Assuming  that  the  problem  of  minimizing  1  [/]  in  T  is 
well  posed,  it  is  easy  to  obtain  its  solution.  In  fact,  the 
expected  risk  can  be  decomposed  in  the  following  way 
(see  appendix  B); 

/[/]  =  £’[(/o(x)  -  /(x))^J  +  El(y  -  /o(x))']  (2) 

where  /o(x)  is  the  so  called  regression  function,  that  is 
the  conditional  mean  of  the  response  given  the  indepen¬ 
dent  variable; 

/o(x)  =  yP(y\^)  ■  (3) 

From  eq.  (2)  it  is  clear  that  the  regression  function  is 
the  function  that  minimizes  the  expected  risk  in  X,  and 
is  therefore  the  best  possible  estimator.  Hence, 

/o(x)  =  arg  min /[/]  . 

However,  it  is  also  clear  that  even  the  regression  func¬ 
tion  will  make  an  error  equal  to  £[(y  -  /o(x))^],  that 
is  the  variance  of  the  response  given  a  certain  value  for 
the  independent  variable,  averaged  over  the  values  the 
independent  variable  can  take.  While  the  first  term  in 
eq.  (2)  depends  on  the  choice  of  the  estimator  /,  the  sec¬ 
ond  term  is  an  intrinsic  limitation  that  comes  from  the 
fact  that  the  independent  variable  x  does  not  determine 
uniquely  the  response  y. 

The  problem  of  learning  from  examples  can  now  be 
reformulated  as  the  problem  of  reconstructing  the  re¬ 
gression  function  fo,  given  the  example  set  Dt.  Thus  we 
have  some  large  class  of  functions  X  to  which  the  target 
function  fo  belongs.  We  obtain  noisy  data  of  the  form 
(x,  y)  where  x  has  the  distribution  P(x)  and  for  each  x, 
y  is  a  random  variable  with  mean  /o(x)  and  distribution 
P(y|x).  We  note  that  y  can  be  viewed  as  a  determin¬ 
istic  function  of  x  corrupted  by  noise.  If  one  assumes 
the  noise  is  additive,  we  can  write  y  =  /o(x)  -I-  tj*  where 
Tfx^  is  zero-mean  with  distribution  P(y|x).  We  choose  an 
estimator  on  the  basis  of  the  data  set  and  we  hope  that 
it  is  close  to  the  regression  (target)  function.  It  should 
also  be  pointed  out  that  this  framework  includes  pat¬ 
tern  classification  and  in  this  case  the  regression  (target) 
function  corresponds  to  the  Bayes  discriminant  function 
[36,  45,  71]. 

2.4  The  Empirical  Risk 

If  the  expected  risk  functional  /[/]  were  known,  one 
could  compute  the  regression  function  by  simply  finding 
its  minimum  in  X,  that  would  make  the  whole  learning 
problem  considerably  easier.  What  makes  the  problem 
difficult  and  interesting  is  that  in  practice  /[/]  is  un¬ 
known  because  P(x,  y)  is  unknown.  Our  only  source  of 
information  is  the  data  set  Di  which  consists  of  I  inde¬ 
pendent  random  samples  of  A'  x  Y  drawn  according  to 
P(x,  y).  Using  this  data  set,  the  expected  risk  can  be 
approximated  by  the  empirical  risk  Itmp- 

*Note  that  the  standard  regression  problem  often  assumes 
T)x  is  independent  of  x.  Our  case  is  distribution  free  because 
we  make  no  assumptions  about  the  nature  of  y*. 


,  ' 

hmpif]  =  J  -  /(Xi))* 

i-\ 

For  «ach  giviMi  estimator  /,  the  empirical  risk  is  a  random 
variable,  and  under  fairly  general  assumptions^,  by  the 
law  of  large  numbers  [23J  it  converges  in  probability  to 
the  expected  risk  as  the  number  of  data  points  goes  to 
infinity; 

Urn  P{\I[f]  -  -  f}  -  0  V.-  -  0  .  (4) 

Therefore  a  common  strategy  consists  in  estimating  the 
regression  function  as  the  function  that  minimiies  the 
empirical  risk,  since  it  is  “close”  to  the  expected  risk  if 
the  number  of  data  is  high  enough.  For  the  error  metric 
we  have  used,  this  yields  the  least-squares  error  estima¬ 
tor.  However,  eq.  (4)  states  only  that  the  expected  risk 
is  “close"  to  the  empirical  risk  for  each  gtven  f.  and  not 
for  all  /  simultaneously.  Consequently  the  fact  that  the 
empirical  risk  converges  in  probability  to  the  expected 
risk  when  the  number,  /,  of  data  points  goes  to  infinity 
does  not  guarantee  that  the  minimum  of  the  empirical 
risk  will  converge  to  the  minimum  of  the  expected  risk 
(the  regression  function).  As  pointed  out  and  analysed 
in  the  fundamental  work  of  Vapnik  and  Chervonenkis 
[81,  82,  83]  the  notion  of  uniform  convergence  in  prob¬ 
ability  has  to  be  introduced,  and  it  will  be  discussed  in 
other  parts  of  this  paper. 

2.5  The  Problem 

The  argument  of  the  previous  section  suggests  that  an 
approximate  solution  of  the  learning  problem  consists  in 
finding  the  minimum  of  the  empirical  risk,  that  is  solving 

min/e„p[/]  . 

However  this  problem  is  clearly  ill-posed,  because,  for 
most  choices  of  it  will  have  an  infinite  number  of 
solutions.  In  fact,  all  the  functions  in  that  interpolate 
the  data  points  (xi,]^),  that  is  with  the  property 

/(Xi)  =  Jfc 

will  give  a  lero  value  for  /,n>p-  This  problem  is  very 
common  in  approximation  theory  and  statistics  and  can 
be  approached  in  several  ways.  A  common  technique 
consists  in  restricting  the  search  for  the  minimum  to  a 
smaller  set  than  We  consider  the  case  in  which  this 
smaller  set  is  a  family  of  parametric  functions,  that  is  a 
family  of  functions  defined  by  a  certain  number  of  real 
parameters.  The  choice  of  a  parametric  representation 
also  provides  a  convenient  way  to  store  and  manipulate 
the  hypothesis  function  on  a  computer. 

We  will  denote  a  generic  subset  of  whose  elements 
are  parametrised  by  a  number  of  parameters  propor¬ 
tional  to  n,  by  Hn-  Moreover,  we  will  assume  that  the 
sets  Hn  form  a  nested  family,  that  is 

*For  example,  assuming  the  data  is  independently  drawn 
and  /[/]  is  finite. 


Hi  C  Hj  C  .  C  Hn  C  .  .  C  H. 

For  example,  H„  could  be  the  set  of  polynomials  in  one 
variable  of  degree  n  -  1.  Radial  Basis  Functions  with  n 
centers,  multilayer  perceptrons  with  n  sigmoidal  hidden 
units,  multilayer  perceptrons  with  n  threshold  units  and 
so  on.  Therefore,  we  choose  as  approximation  to  the 
regression  function  the  function  fn.i  defined  as:^ 

f„_i  ~  arg  min  I^mp'J]  (5) 

/tH. 

Thus,  for  example,  if  H„  is  the  class  of  functions  which 
can  be  represented  as  /  =  <'o  W{x;  Wn)  then  eq. 

(5)  can  be  written  as 

fn_i  =  arg  mm  /,mp[/] 

f  A  ,wf  « 

A  number  of  observations  need  to  be  made  here.  First, 
if  the  class  is  small  (typically  in  the  sense  of  bounded 
V'C-dimension  or  bounded  metric  entropy  [69]),  then  the 
problem  is  not  necessarily  ill- posed  and  we  do  not  have  to 
go  through  the  process  of  using  the  sets  Hn-  However,  as 
has  been  mentioned  already  for  most  interesting  choices 
of  T  (e.g.  classes  of  functions  in  Sobolev  spaces,  con¬ 
tinuous  functions  etc.)  the  problem  might  be  ill  posed. 
However,  this  might  not  be  the  only  reason  for  using  the 
classes  Hn-  It  might  be  the  case  that  that  is  all  we  have 
or  for  some  reason  it  is  something  we  would  like  to  use. 
For  example,  one  might  want  to  use  a  particular  class  of 
feed-forward  networks  because  of  ease  of  implementation 
in  VLSI.  Also,  if  we  were  to  solve  the  function  learning 
problem  on  a  computer  as  is  typically  done  in  practice, 
then  the  functions  in  JF  have  to  be  represented  some¬ 
how.  We  might  consequently  use  as  a  representation 
scheme.  It  should  be  pointed  out  that  the  sets  Hn  and 
^  have  to  be  matched  with  each  other.  For  example, 
we  would  hardly  use  polynomials  as  an  approximation 
scheme  when  the  class  T  consists  of  indicator  functions 
or  for  that  matter  use  threshold  units  when  the  class 
contains  continuous  functions.  In  particular,  if  we  are  to 
recover  the  regression  function,  H  must  be  dense  in  T . 
One  could  look  at  this  matching  from  both  directions. 
For  a  class  T,  one  might  be  interested  in  an  appropriate 
choice  of  Hn-  Conversely,  for  a  particular  choice  of 
one  might  ask  what  classes  T  can  be  effectively  solved 
with  this  scheme.  Thus,  if  we  were  to  use  multilayer 
perceptrons,  this  line  of  questioning  would  lead  us  to 
identify  the  class  of  problems  which  can  be  effectively- 
solved  by  them. 

Thus,  we  see  that  in  principle  we  would  like  to  min- 
imiie  /[/]  over  the  large  class  T  obtaining  thereby  the 

’  Notice  that  we  are  implicitly  assuming  that  the  problem 
of  miniiing  /«inp[/]  over  Hn  has  a  solution,  which  might  not 
be  the  case.  However  the  quantity 

Hn.i  =  ud  Lmp]/] 

is  always  well  defined,  and  we  can  always  find  a  function  fn.i 
for  which  Ampf/a.i]  is  arbitrarily  close  to  Ens-  It  will  turn 
out  that  this  is  sufficient  for  our  purposes,  and  therefore  we 
will  continue,  assuming  that  fn.x  is  well  defined  by  eq.  (5) 


4 


regression  function  /q.  What  we  do  in  practice  is  to  min¬ 
imise  the  empirical  risk  /*mp[/]  over  the  smaller  class  H„ 
obtaining  the  function  Assuming  we  have  solved  all 
the  computational  problems  related  to  the  actual  com¬ 
putation  of  the  estimator  the  main  problem  is  now; 

how  good  is  f^i? 

Independently  of  the  measure  of  performance  that  we 
choose  when  answering  this  question,  we  expect  /„  i  to 
become  a  better  and  better  estimator  as  n  and  /  go  to 
infinity.  In  fact,  when  I  increases,  our  estimate  of  the  ex¬ 
pected  risk  improves  and  our  estimator  improves.  The 
case  of  n  is  trickier.  As  n  increases,  we  have  more  param¬ 
eters  to  model  the  regression  function,  and  our  estimator 
should  improve.  However,  at  the  same  time,  because  we 
have  more  parameters  to  estimate  with  the  same  amount 
of  data,  our  estimate  of  the  expected  risk  deteriorates. 
Thus  we  now  need  more  data  and  n  and  I  have  to  grow 
as  a  function  of  each  other  for  convergence  to  occur. 
At  what  rate  and  under  what  conditions  the  estimator 
improves  depends  on  the  properties  of  the  regression 
function,  that  is  on  and  on  the  approximation  scheme 
we  are  using,  that  is  on  ff„. 

2.6  Bounding  the  Generalization  Error 

At  this  stage  it  might  be  worthwhile  to  review  and  re¬ 
mark  on  some  general  features  of  the  problem  of  learning 
from  examples.  Let  us  remember  that  our  goal  is  to  min¬ 
imize  the  expected  risk  /[/]  over  the  set  T .  If  we  were  to 
use  a  finite  number  of  parameters,  then  we  have  already 
seen  that  the  best  we  could  possibly  do  is  to  minimize 
our  functional  over  the  set  J¥„,  yielding  the  estimator 

in- 

fn  =  arg  min  /[/]  . 

/  €  «  n 

However,  not  only  is  the  parametrization  limited,  but 
the  data  is  also  finite,  and  we  can  only  minimize  the 
empirical  risk  femp.  obtaining  as  our  final  estimate  the 
function  /n,j.  Our  goal  is  to  bound  the  distance  from 
fn,i  that  is  our  solution,  from  /o,  that  is  the  "optimal” 
solution.  If  we  choose  to  measure  the  distance  in  the 
L^{P)  metric  (see  appendix  A),  the  quantity  that  we 
need  to  bound,  that  we  will  call  generalization  error,  is: 

E[{h  -  =  /x  ^(*)(/0(*)  -  LM?  = 

=  ll/o  ^  /n,l|li3(p) 

There  are  2  main  factors  that  contribute  to  the  gener¬ 
alization  error,  and  we  are  going  to  analyze  them  sepa¬ 
rately  for  the  moment. 

1.  A  first  cause  of  error  comes  from  the  fact  that 
we  are  trying  to  approximate  an  infinite  dimen¬ 
sional  object,  the  regression  function  /o  G  T,  with 
a  finite  number  of  parameters.  We  call  this  er¬ 
ror  the  approximation  error,  and  we  measure  it  by 
the  quantity  E[{fo  —  fn)^],  that  is  the  L2(P)  dis¬ 
tance  between  the  best  function  in  and  the  re¬ 
gression  function.  The  approximation  error  can  be 


expressed  in  terms  of  the  expected  risk  using  the 
decomposition  (2)  as 

^[(/0-/.)^]  =  /[/ni-/[/0].  (6) 

Notice  that  the  approximation  error  does  not  de¬ 
pend  on  the  data  set  Dt,  but  depends  only  on  the 
approximating  power  of  the  class  H„.  The  natural 
framework  to  study  it  is  approximation  theory,  that 
abound  with  bounds  on  the  approximation  error  for 
a  variety  of  choices  of  Hn  and  T .  In  the  following 
we  will  always  assume  that  it  is  possible  to  bound 
the  approximation  error  as  follows: 

where  £-(n)  is  a  function  that  goes  to  zero  as  n  goes 
to  infinity  if  H  is  dense  in  T .  In  other  words, 
as  shown  in  figure  (1),  as  the  number  n  of  pa¬ 
rameters  gets  larger  the  representation  capacity  of 
H-n  increases,  and  allows  a  better  and  better  a.p- 
proximation  of  the  regression  function  /o-  This  is¬ 
sue  has  been  studied  by  a  number  of  researchers 
[18,  44,  6,  8,  30,  57,  56]  in  the  neural  networks  com¬ 
munity. 

2.  Another  source  of  error  comes  from  the  fact  that, 
due  to  finite  data,  we  minimize  the  empirical  risk 
femp[/],  and  obtain  rather  than  minimizing 

the  expected  risk  /[/],  and  obtaining  /„.  As  the 
number  of  data  goes  to  infinity  we  hope  that 
will  converge  to  /„ ,  and  convergence  will  take  place 
if  the  empirical  risk  converges  to  the  expected  risk 
uniformly  in  probability  [80].  The  quantity 

|/emp[/]  -  /[/]| 

is  called  estimation  error,  and  conditions  for  the 
estimation  error  to  converge  to  zero  uniformly  in 
probability  have  been  investigated  by  Vapnik  and 
Chervonenkis  [81,  82,  80,  83]  Pollard  [69],  Dudley 
[24],  and  Haussler  [42].  Under  a  variety  of  different 
hypothesis  it  is  possible  to  prove  that,  with  proba¬ 
bility  1  —  6,  a  bound  of  this  form  is  valid: 

|/en.p[/]-/[/]l  <u;(l,n,6)  V/Gff„  (7) 

The  specific  form  of  w  depends  on  the  setting  of  the 
problem,  but,  in  general,  we  expect  a)(l,n,6)  to  be 
a  decreasing  function  of  1.  However,  we  also  expect 
it  to  be  an  increasing  function  of  n.  The  reason 
is  that,  if  the  number  of  parameters  is  large  then 
the  expected  risk  is  a  very  complex  object,  and  then 
more  data  will  be  needed  to  estimate  it.  Therefore, 
keeping  fixed  the  number  of  data  and  increasing  the 
number  of  parameters  will  result,  on  the  average, 
in  a  larger  distance  between  the  expected  risk  and 
the  empirical  risk. 

The  approximation  and  estimation  error  are  clearly 
two  components  of  the  generalization  error,  and  it  is  in¬ 
teresting  to  notice,  as  shown  in  the  next  statement,  the 
generalization  error  can  be  bounded  by  the  sum  of  the 
two: 
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statement  2.1  The  following  inequality  holds: 

ll/o  —  fn,i\\\'‘(P)  ^  n,  8)  .  (8) 

Proof:  using  the  decomposition  of  the  expected  risk  (2), 
the  generalization  error  can  be  written  as: 

ll/o  -  /n./||i.(p)  =  EH/O  -  A.lf]  =  -  Ilfo]  .  (9) 

A  natural  way  of  bounding  the  generalization  error  is  as 
follows; 

EH/o  -  /n,/)']  <  |/[/„]  -  /[/o]l  +  |I[/n]  -  /[/n,»]l  •  (10) 

In  the  first  term  of  the  tight  hand  side  of  the  previous 
inequality  we  recognize  the  approximation  error  (6).  If 
a  bound  of  the  form  (7)  is  known  for  the  generalization 
error,  it  is  simple  to  show  (see  appendix  (C)  that  the 
second  term  can  be  bounded  as 

Wn]-/[/n,j]|<2w(Z,n,6) 

and  statement  (2.1)  follows  □. 

Thus  we  see  that  the  generalization  error  has  two  com¬ 
ponents:  one,  bounded  by  e(n),  is  related  to  the  approxi¬ 
mation  power  of  the  class  of  functions  {.&«}>  and  is  stud¬ 
ied  in  the  framework  of  approximation  theory.  The  sec¬ 
ond,  bounded  by  u)(l,n,6),  is  related  to  the  difficulty  of 
estimating  the  parameters  given  finite  data,  and  is  stud¬ 
ied  in  the  framework  of  statistics.  Consequently,  results 
from  both  these  fields  are  needed  in  order  to  provide  an 
understanding  of  the  problem  of  learning  from  examples. 
Figure  (1)  also  shows  a  picture  of  the  problem. 

2.7  A  Note  on  Models  and  Model  Complexity 

From  the  form  of  eq.  (8)  the  reader  will  quickly  realize 
that  there  is  a  trade-off  between  n  eind  I  for  a  certain 
generalization  error.  For  a  fixed  I,  as  n  increases,  the 
approximation  error  f(n)  decreases  but  the  estimation 
error  ur{l,  n,  8)  increases.  Consequently,  there  is  a  certain 
n  which  might  optimally  balance  this  trade-off.  Note 
that  the  cleisses  Ifn  can  be  looked  upon  as  models  of 
increasing  com; 'cxity  and  the  search  for  an  optimal  n 
amounts  to  a  search  for  the  right  model  complexity.  One 
typicedly  wishes  to  match  the  model  complexity  with  the 
sample  complexity  (measured  by  how  much  data  we  have 
on  hand)  and  this  problem  is  well  studied  [29,  75,  52,  73, 

4,  28,  17]  in  statistics. 

Broadly  speaking,  simple  models  would  have  high 
approximation  errors  but  small  estimation  errors  while 
complex  models  would  have  low  approximation  errors 
but  high  estimation  errors.  This  might  be  true  even 
when  considering  qualitatively  different  models  and  as 
an  illustrative  example  let  us  consider  two  kinds  of  mod¬ 
els  we  might  use  to  learn  regression  functions  in  the 
space  of  bounded  continuous  functions.  The  class  of 
linear  models,  i.e.,  the  class  of  functions  which  can  be 
expressed  as  /  =  w-x-l-S,  do  not  have  much  approximat¬ 
ing  power  and  consequently  their  approximation  error  is 
rather  high.  However,  their  estimation  error  is  quite  low. 
The  class  of  models  which  can  be  expressed  in  the  form 
If  =  Cj  sin(w,-  ■  X  -I-  0j)  have  higher  approximating 
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Figure  1;  This  figure  shows  a  picture  of  the  problem. 
The  outermost  circle  represents  the  set  F.  Embedded  in 
this  are  the  nested  subsets,  the  lf„’s.  fo  is  an  arbitrary 
target  function  in  T',  /„  is  the  closest  element  of  Hn  and 
/„_(  is  the  element  of  /f„  which  the  learner  hypothesizes 
on  the  basis  of  data. 


power  [47]  resulting  in  low  approximation  errors.  How¬ 
ever  this  class  has  an  infinite  VC-dimension  [82]  and  its 
estimation  error  can  not  therefore  be  bounded. 

So  far  we  have  provided  a  very  general  characteriza¬ 
tion  of  this  problem,  without  stating  what  the  sets  !F 
and  H„  are.  As  we  have  already  mentioned  before,  the 
set  T  could  be  a  set  of  bounded  differentiable  or  inte- 
grable  functions,  and  could  be  polynomials  of  degree 
n,  spline  functions  with  n  knots,  multilayer  perceptrons 
with  n  hidden  units  or  any  other  parametric  approxima¬ 
tion  scheme  with  n  parameters.  In  the  next  section  we 
will  consider  a  specific  choice  for  these  sets,  and  we  will 
provide  a  bound  on  the  generalization  error  of  the  form 
of  eq.  (8). 

3  Stating  the  Problem  for  Radial  Basis 
Functions 

As  mentioned  before  the  problem  of  learning  from  exam¬ 
ples  reduces  to  estimating  some  target  function  from  a 
set  X  to  a  set  y.  In  most  practical  cases,  such  as  char¬ 
acter  recognition,  motor  control,  time  series  prediction, 
the  set  X  is  the  fc-dimensional  Euclidean  space  R!‘  ,  and 
the  set  y  is  some  subset  of  the  real  line,  that  for  our  pur¬ 
poses  we  will  assume  to  be  the  interved  [— Af,  Af],  where 
Af  is  some  positive  number.  In  fact,  there  is  a  probability 
distribution  P(x,  y)  defined  on  the  space  iZ*  x  [— Af,  Af] 
according  to  which  the  labelled  examples  are  drawn  in¬ 
dependently  at  random,  and  from  which  we  try  to  esti¬ 
mate  the  regression  (target)  function.  It  is  clear  that  the 
regression  function  is  a  real  function  of  k  variables. 

In  this  paper  we  focus  our  attention  on  the  Radial  Ba- 


sis  Functions  approximation  scheme  (also  called  Hyper- 
Basis  Functions  [67]).  This  is  the  class  of  approximating 
functions  that  can  be  written  as: 

n 

/(x)  =  ^/3iG(x-ti) 

i=l 

where  G  is  some  given  basis  function  and  the  0i  and 
the  ti  are  free  parameters.  We  would  like  to  understand 
what  classes  of  problems  can  be  solved  “well”  by  this 
technique,  where  “well”  means  that  both  approximation 
and  estimation  bounds  need  to  be  favorable.  W?  will  see 
later  that  a  favorable  approximation  bound  can  be  ob¬ 
tained  if  we  assume  that  the  cleiss  of  functions  ^  to  which 
the  regression  function  belongs  is  defined  as  follows: 

:F  =  {/ei2(i?*)|/  =  A*G,lA|fl.  <M}.  (11) 

Here  A  is  a  signed  Radon  measure  on  the  Borel  sets  of 
i?*,  G  is  a  gaussian  function  with  range  in  [0,F],  the 
symbol  *  stands  for  the  convolution  operation,  |A|flk  is 
the  total  variation®  of  the  measure  A  and  M  is  a  positive 
real  number.  We  point  out  that  the  class  J'  is  non-trivial 
to  learn  in  the  sense  that  it  has  infinite  pseudo-dimension 
[69]. 

In  order  to  obtain  an  estimation  bound  we  need  the 
approximating  class  to  have  bounded  variation,  and  the 
following  constraint  will  be  imposed: 

n 

i=l 

We  will  see  in  the  proof  that  this  constraint  does  not 
affect  the  approximation  bound,  and  the  two  pieces  fit 
together  nicely.  Thus  the  set  H„  is  defined  now  as  the 
set  of  functions  belonging  to  such  that 

n  n 

/(xl=:5^/3iG(x-t,),  ^|A1<M,  UeR"  (12) 

i=l  i=l 

Having  defined  the  sets  H„  and  we  remind  the  reader 
that  our  goal  is  to  recover  the  regression  function,  that  is 
the  minimum  of  the  expected  risk  over  T .  What  we  end 
up  doing  is  to  draw  a  set  of  I  examples  and  to  minimize 
the  empirical  risk  Itmp  over  the  set  H„,  that  is  to  solve 
the  following  non-convex  minimization  problem: 

I  n 

=  arg  min  ^(y.  -  ^  /3c,G(xi  -  t„))^  (13) 

=  l  a  =  l 

Notice  that  assumption  that  the  regression  function 
/o(x)  =  £:[y|x] 

belongs  to  the  class  T  correspondingly  implies  an  as¬ 
sumption  on  the  probability  distribution  P(y|x),  viz., 

signed  measure  A  can  be  decomposed  by  the  Hahn- 
Jordan  decomposition  into  A  =  A'*'  —  A“.  Then  |A|  =  A'*’  -(-A” 
is  called  the  total  variation  of  A.  See  Dudley  [23]  for  more 
information. 


that  P  must  be  such  that  £’[ylx]  belongs  to  T .  Notice 
also  that  since  we  assumed  that  V'  is  a  closed  interval, 
we  are  implicitly  assuming  that  P(j/|x)  has  compact  sup¬ 
port. 

Assuming  now  that  we  have  been  able  to  solve  the 
minimization  problem  of  eq.  (13),  the  main  question  we 
are  interested  in  is  “how  far  is  /  from  /o?”.  We  give 
an  answer  in  the  next  section. 

4  Main  Result 

The  main  theorem  is: 

Theorem  4.1  For  any  0  <  S  <  1,  for  n  nodes,  I  data 
points,  input  dimensionality  of  k ,  and  Hn,  F,  fo>  fn,i 
as  defined  in  the  statement  of  the  problem  above,  with 
probability  greater  than  I  ~  6, 


ll/o 


fn,l\\\^(P) 


+  0 


nk  ln(n/)  —  In  ^ 


/ 


1/2 


Proof:  The  proof  requires  us  to  go  through  a  series  of 
propositions  and  lemmas  which  have  been  relegated  to 
appendix  (D)  for  continuity  of  ideas. □ 

5  Remarks 

There  are  a  number  of  comments  we  would  like  to  make 
on  the  formulation  of  our  problem  and  the  result  we 
have  obtained.  There  is  a  vast  body  of  literature  on 
approximation  theory  and  the  theory  of  empirical  risk 
minimization.  In  recent  times,  some  of  the  results  in 
these  areas  have  been  applied  by  the  computer  science 
and  neural  network  community  to  study  formal  learning 
models.  Here  we  would  like  to  make  certain  observations 
about  our  result,  suggest  extensions  and  future  work, 
and  to  make  connection^  with  other  work  done  in  related 
areas. 


5.1  Observations  on  the  Main  Result 

•  The  theorem  has  a  PAC[79]  like  setting.  It  tells 
us  that  if  we  draw  enough  data  points  (labeUed 
examples)  and  have  enough  nodes  in  our  Radial 
Basis  Functions  network,  we  can  drive  out  error 
arbitrarily  close  to  zero  with  arbitrarily  high  prob¬ 
ability.  Note  however  that  our  result  is  not  en¬ 
tirely  distribution-free.  Although  no  assumptions 
are  made  on  the  form  of  the  underlying  distribu¬ 
tion,  we  do  have  certain  constraints  on  the  kinds 
of  distributions  for  which  this  result  holds.  In  par¬ 
ticular,  the  distribution  is  such  that  its  conditional 
mean  i^lylx]  (this  is  also  the  regression  function 
/o(a:))  must  belong  to  a  the  class  of  functions  F  de¬ 
fined  by  eq.  (11).  Further  the  distribution  P(y(x) 
must  have  compact  support 


^This  condition,  that  is  related  to  the  problem  of  large  de¬ 
viations  [80],  could  be  relaxed,  and  will  be  subject  of  further 
investigations. 


•  The  ettor  bound  consists  of  two  parts,  one 

(0(l/n))  coming  iiom  approximation  theory,  and 
the  other  0(((nkla(nJ)  +  ln(l/^))/i)^/^)  from 
statistics.  It  is  noteworthy  that  for  a  given  approx¬ 
imation  scheme  (corresponding  to  a  certain 

class  of  functions  (corresponding  to  /")  suggests  it¬ 
self.  So  we  have  gone  from  the  class  of  networks 
to  the  class  of  problems  they  can  perform  as  op¬ 
posed  to  the  other  way  around,  i.e.,  from  a  class  of 
problems  to  an  optimal  class  of  networks. 

•  This  sort  of  a  result  implies  that  if  we  have  the 
prior  knowledge  that  /o  belongs  to  class  T ,  then 
by  choosing  the  number  of  data  points,  I,  and  the 
number  of  basis  functions,  n,  appropriately,  we  can 
drive  the  misclassiiication  error  arbitrarily  close  to 
Bayes  rate.  In  fact,  for  a  fixed  amount  of  data, 
even  before  we  have  started  looking  at  the  data, 
we  can  pick  a  starting  architecture,  i.e.,  the  num¬ 
ber  of  nodes,  n,  for  optimal  performance.  After 
looking  at  the  data,  we  might  be  able  to  do  some 
structural  risk  minimization  [80]  to  further  improve 
architecture  selection.  For  a  fixed  architecture,  this 
result  sheds  light  on  how  much  data  is  required  for 
a  certain  error  performance.  Moreover,  it  allows  us 
to  choose  the  number  of  data  points  and  number  of 
nodes  simultaneously  for  guaranteed  error  perfor¬ 
mances.  Section  6  explores  this  question  in  greater 
detail. 

5.2  Extensions 

•  There  are  certain  natural  extensions  to  this  work. 
We  have  essentiaJly  proved  the  consistency  of  the 
estimated  network  function  /„,j.  In  particular  we 
have  shown  that  fn,i  converges  to  /o  with  proba¬ 
bility  1  as  /  and  n  grow  to  infinity.  It  is  also  pos¬ 
sible  to  derive  conditions  for  almost  sure  conver¬ 
gence.  Further,  we  have  looked  at  a  specific  class 
of  networks  {{Hn})  which  consist  of  weighted  sums 
of  Gaussian  bcisis  functions  with  moving  centers 
but  fixed  variance.  This  kind  of  an  approximation 
scheme  suggests  a  class  of  functions  T  which  can 
be  approximated  with  guaranteed  rates  of  conver¬ 
gence  as  mentioned  earlier.  We  could  prove  similar 
theorems  for  other  kinds  of  basis  functions  which 
would  have  stronger  approximation  properties  than 
the  class  of  functions  considered  here.  The  general 
principle  on  which  the  proof  is  based  can  hopefully 
be  extended  to  a  variety  of  approximation  schemes. 

•  We  have  used  notions  of  metric  entropy  and  cover¬ 
ing  number  [69,  24]  in  obtaining  our  uniform  con¬ 
vergence  results.  Haussler  [42]  uses  the  results  of 
Pollard  and  Dudley  to  obtain  uniform  convergence 
results  and  our  techniques  closely  follow  his  ai>- 
proach.  It  should  be  noted  here  that  Vapnik  [80] 
deals  with  exactly  the  same  question  and  uses  the 
VC-dimension  instead.  It  would  be  interesting  to 
compute  the  VC-dimension  of  the  class  of  networks 
and  use  it  to  obtain  our  results. 

•  While  we  have  obtained  an  upper  bound  on  the  er¬ 
ror  in  terms  of  the  number  of  nodes  and  examples. 


it  would  be  worthwhile  to  obtain  lower  bounds  on 
the  same.  Such  lower  bounds  do  not  seem  to  exist 
in  the  neural  network  literature  to  the  best  of  our 
knowledge. 

•  We  have  considered  here  a  situation  where  the  es¬ 
timated  network  i.e.,  f„  i  is  obtained  by  minimiz¬ 
ing  the  empirical  risk  over  the  class  of  functions 
Hn-  Very  often,  the  estimated  network  is  obtained 
by  minimizing  a  somewhat  different  objective  func¬ 
tion  which  consists  of  two  parts.  One  is  the  fit  to 
the  data  and  the  other  is  some  complexity  term 
which  favours  less  complex  (according  to  the  de¬ 
fined  notion  of  complexity)  functions  over  more 
complex  ones.  For  example  the  regularization  ap¬ 
proach  [77,  68,  84]  minimizes  a  cost  function  of  the 
form 

JV 

i=t 

over  the  class  H  =  Here  A  is  the  so 

called  ""regularization  parameter”  and  ♦[/]  is  a 
functional  which  measures  smoothness  of  the  func¬ 
tions  involved.  It  would  be  interesting  to  obtain 
convergence  conditions  and  rates  for  such  schemes. 
Choice  of  an  optimal  A  is  an  interesting  question 
in  regularization  techniques  and  typically  cross- 
validation  or  other  heuristic  schemes  are  used.  A 
result  on  convergence  rate  potentially  offers  a  prin¬ 
cipled  way  to  choose  A. 

•  Structural  risk  minimization  is  another  method 
to  achieve  a  trade-off  between  network  complex¬ 
ity  (corresponding  to  n  in  our  case)  and  fit  to 
data.  However  it  does  not  guarantee  that  the  ar¬ 
chitecture  selected  will  be  the  one  with  minimal 
parametrization*.  In  fact,  it  would  be  of  some 
interest  to  develop  a  sequential  growing  scheme. 
Such  a  technique  would  at  any  stage  perform  a  se¬ 
quential  hypothesis  test  [37].  It  would  then  decide 
whether  to  ask  for  more  data,  add  one  more  node 
or  simply  stop  and  output  the  function  it  has  as 
its  e-good  hypothesis.  In  such  a  process,  one  might 
even  incorporate  active  learning  [2,  62]  so  that  if  the 
edgorithm  asks  for  more  data,  then  it  might  even 
specify  a  region  in  the  input  domain  from  where  it 
would  iike  to  see  this  data.  It  is  conceivable  that 
such  a  scheme  would  grow  to  minimal  parametriza- 
tion  (or  closer  to  it  at  any  rate)  and  require  less 
data  than  classical  structural  risk  minimization. 

•  It  should  be  noted  here  that  we  have  assumed  that 

the  empirical  risk  X]i=i(y>  ~  /(*«))^  min¬ 

imized  over  the  class  H„  and  the  function  be 
effectively  computed.  While  this  might  be  fine  in 
principle,  in  practice  only  a  locally  optimal  solu¬ 
tion  to  the  minimization  problem  is  found  (typi¬ 
cally  using  some  gradient  descent  schemes).  The 

^Neither  does  regularization  for  that  matter.  The  ques¬ 
tion  of  nninimal  parametrization  is  related  to  that  of  order 
determination  of  systems,  a  very  difficult  problem! 


computational  complexity  of  obtaining  even  an  ap¬ 
proximate  solution  to  the  minimization  problem  is 
an  interesting  one  and  resalts  from  computer  sci¬ 
ence  [49,  12]  suggest  that  it  might  in  gene^il  be 
iVP-hard. 

5.3  Connections  with  Other  Resul' 

•  In  the  neural  network  and  computational  learning 
theory  communities  results  have  been  obtained  per¬ 
taining  to  the  issues  oi  generalization  and  learn- 
ability.  Some  theoretical  work  has  been  done 
[10,  42,  61]  in  .-haiacterizing  the  sample  complex¬ 
ity  of  finite  sized  networks.  Of  these,  it  is  worth¬ 
while  to  mention  again  the  work  of  Haussler  [42] 
from  which  this  paper  derives  much  inspiration. 
He  obtains  bounds  for  a  fixed  hypothesis  space  i.e. 
a  fixed  finite  network  architecture.  Here  we  deal 
with  families  of  hypothesis  spaces  using  richer  and 
richer  hypothesis  spaces  as  more  and  more  data 
becomes  available.  Later  we  will  characterize  the 
trade-off  between  hypothesis  complexity  and  error 
rate.  Others  [27,  63]  attempt  to  characterize  the 
generalization  abilities  of  feed-forward  networks  us¬ 
ing  theoretical  formalizations  from  statistical  me¬ 
chanics.  Yet  others  [13,  60,  16,  1]  attempt  to  obtain 
empirical  bounds  on  generalization  abilities. 

•  This  is  an  attempt  to  obtain  rate-of-convergence 
bounds  in  the  spirit  of  Barron’s  work  [5],  but  using 
a  different  approach.  We  have  chosen  to  combine 
theorems  from  approximation  theory  (which  gives 
us  the  0(l/n)  term  in  the  rate,  and  uniform  con¬ 
vergence  theory  (which  gives  us  the  other  part). 
Note  that  at  this  moment,  our  rate  of  convergence 
is  worse  than  Barron’s.  In  particular,  he  obtains  a 
rate  of  convergence  of  0(1/ti  -f  (nfc  ln(/))//).  Fur¬ 
ther,  he  has  a  different  set  of  assumptions  on  the 
class  of  functions  (corresponding  to  our  T).  Fi¬ 
nally,  the  approximation  scheme  is  a  class  of  net¬ 
works  with  sigmoidal  units  ets  opposed  to  radial- 
basis  units  and  a  different  proof  technique  is  used. 
It  should  be  mentioned  here  that  his  proof  relies 
on  a  discretization  of  the  networks  into  a  countable 
family,  while  no  such  assumption  is  made  here. 


like  our  scheme,  Gaussian-kernel  regressors  require 
the  variance  of  the  Gaussian  to  go  to  zero  as  a  func¬ 
tion  of  the  data.  Further  the  number  of  kernels  is 
always  equal  to  the  number  of  data  points  and  the 
issue  of  trade-off  between  the  two  is  not  explored 
to  the  same  degree. 

•  In  our  statement  of  the  problem,  we  discussed  how 
pattern  classification  could  be  treated  as  a  spe¬ 
cial  case  of  regression.  In  this  case  the  function 
/o  corresponds  to  the  Bayes  a-posteriori  decision 
function.  Researchers  [71,  45,  36]  in  the  neural 
network  community  have  observed  that  a  network 
trained  on  a  least  square  error  criterion  and  used 
for  pattern  classification  was  in  effect  computing 
the  Bayes  decision  function.  This  paper  provides  a 
rigorous  proof  of  the  conditions  under  which  this  is 
the  case. 

6  Implications  of  the  Theorem  in 
Practice:  Putting  In  the  Numbers 

We  have  stated  our  main  result  in  a  particular  form.  We 
have  provided  a  provable  upper  bound  on  the  error  (in 
II  ■  lli’(/>)  metric)  in  terms  of  the  number  of  exam¬ 
ples  and  the  number  of  basis  functions  used.  Further  we 
have  provided  the  order  of  the  convergence  and  have  not 
stated  the  constants  involved.  The  same  result  could  be 
stated  in  other  forms  and  has  certain  implications.  It 
provides  us  rates  at  which  the  number  of  basis  functions 
(n)  should  increase  as  a  function  of  the  number  of  exam¬ 
ples  (!)  in  order  to  guarantee  con vergence( Section  6.1). 
It  also  provides  us  with  the  trade-offs  between  the  two 
as  explored  in  Section  6.2. 

6.1  Rate  of  Growth  of  n  for  Guaranteed 
Convergence 

From  our  theorem  (4.1)  we  see  that  the  generalization  er¬ 
ror  converges  to  zero  only  if  n  goes  to  infinity  more  slowly 
than  1.  In  fact,  if  n  grows  too  quickly  the  estimation  er¬ 
ror  w(/,  n,  6)  will  diverge,  because  it  is  proportional  to  n. 
In  fact,  setting  n  —  T ,  we  obtain 

lim;_  +  oc ‘*'(1,71,6)  = 


•  It  would  be  worthwhile  to  make  a  reference  to  Ge- 
man’s  paper  [31]  which  talks  of  the  Bias- Variance 
dilemma.  This  is  another  way  of  formulating  the 
trade-off  between  the  approximation  error  and  the 
estimation  error.  As  the  number  of  parameters 
(proportional  to  n)  increases,  the  bias  (which  can 
be  thought  of  as  analogous  to  the  approximation 
error)  of  the  estimator  decreases  and  its  variance 
(which  can  be  thought  of  as  analogous  to  the  esti¬ 
mation  error)  increases  for  a  fixed  size  of  the  data 
set.  Finding  the  right  bias- variance  trade-off  is  very 
similar  in  spirit  to  finding  the  trade-off  between 
network  complexity  and  data  complexity. 

•  Given  the  cljiss  of  radial  basis  functions  we  are  us¬ 
ing,  a  natural  comparison  arises  with  kernel  regres¬ 
sion  [50,  22]  and  results  on  the  convergence  of  ker¬ 
nel  estimators.  It  should  be  pointed  out  that,  un- 


=  lim,_,^  O  (  = 

=  lim/^+oo  In  I  ■ 

Therefore  the  condition  r  <  1  should  hold  in  order  to 
guarantee  convergence  to  zero. 

6.2  Optimal  Choice  of  n 

In  the  previous  section  we  made  the  point  that  the  num¬ 
ber  of  parameters  n  should  grow  more  slowly  than  the 
number  of  data  points  I,  in  order  to  guarantee  the  con¬ 
sistency  of  the  estimator  /„,(■  h  i®  quite  clear  that  there 
is  an  optimal  rate  of  growth  of  the  number  of  parame¬ 
ters,  that,  for  any  fixed  amount  of  data  points  I,  gives 
the  best  possible  performance  with  the  least  number  of 
parameters.  In  other  words,  for  any  fixed  I  there  is  an 


optim^ll  number  of  parameters  n*(i)  that  minimizes  the 
generalization  error.  That  such  a  number  should  exist 
is  quite  intuitive;  for  a  fixed  number  of  data,  a  small 
number  of  parameters  will  give  a  low  estimation  error 
but  very  high  approximation  error  c(n),  and 
therefore  the  generalization  error  will  be  high.  If  the 
number  of  parameters  is  very  high  the  approximation 
error  £(n)  will  be  very  small,  but  the  estimation  error 
u»(/,  n,  6)  will  be  high,  leading  to  a  large  generalization  er¬ 
ror  again.  Therefore,  somewhere  in  between  there  should 
be  a  number  of  parameters  high  enough  to  make  the  ap¬ 
proximation  error  small,  but  not  too  high,  so  that  these 
parameters  can  be  estimated  reliably,  with  a  small  esti¬ 
mation  error.  This  phenomenon  is  evident  from  figure 
(2),  where  we  plotted  the  generalization  error  eis  a  func¬ 
tion  of  the  number  of  parameters  n  for  various  choices 
of  sample  size  1.  Notice  that  for  a  fixed  sample  size,  the 
error  passes  through  a  minimum.  Notice  that  the  loca¬ 
tion  of  the  minimum  shifts  to  the  tight  when  the  sample 
size  is  increased. 


Figure  2:  Bound  on  the  generalization  error  as  a  function 
of  the  number  of  basis  functions  n  keeping  the  sample 
size  I  fixed.  This  has  been  plotted  for  a  few  different 
choices  of  sample  size.  Notice  how  the  generalization  er¬ 
ror  goes  through  a  minimum  for  a  certain  value  of  n. 
This  would  be  an  appropriate  choice  for  the  given  (con¬ 
stant)  data  complexity.  Note  also  that  the  minimum  is 
broader  for  larger  I,  that  is,  an  accurate  choice  of  n  is 
less  critical  when  plenty  of  data  is  available. 


In  order  to  find  out  exactly  what  is  the  optimal  rate  of 
growth  of  the  network  size  we  simply  find  the  minimum 
of  the  generalization  error  as  a  function  of  n  keeping 
the  sample  size  I  fixed.  Therefore  we  have  to  solve  the 
equation: 


on 

for  n  as  a  function  of  /.  Substituting  the  bound  given  in 
theorem  (4.1)  in  the  previous  equation,  and  setting  all 
the  constants  to  1  for  simplicity,  we  obtain: 


dn 


1 

-  -I- 

n 


nk  ln(n/)  -  ln(6) 
I 


1 

3 


=  0  . 


Performing  the  derivative  the  expression  above  can  be 
written  as 


1 


n 


2 


1  [ <rn  ln(n/)  -  In  ^ 

2  I 


y[ln(n/)  +  1] 


We  now  make  the  assumption  that  I  is  big  enough  to 
let  us  perform  the  approximation  ln(n/)  -H  1  s:  In(nl). 
Moreover,  we  assume  that 


i«(n/)- 

in  such  a  way  that  the  term  including  6  in  the  equa¬ 
tion  above  is  negligible.  After  some  algebra  we  therefore 
conclude  that  the  optimal  number  of  parameters  n*(l) 
satisfies,  for  large  1,  the  equation; 


n*(0- 


4/ 


[/fcln(n*(/)/). 

From  this  equation  is  clear  that  n*  is  roughly  propor¬ 
tional  to  a  power  of  /,  and  therefore  we  can  neglect  the 
factor  n*  in  the  denominator  of  the  previous  equation, 
since  it  will  only  affect  the  result  by  a  multiplicative  con¬ 
stant.  Therefore  we  conclude  that  the  optimal  number 
of  parameters  n*(l)  for  a  given  number  of  examples  be¬ 
haves  as 


n'{l)oc 


1 

I  > 
k  In  I 


(14) 
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In  order  to  show  that  this  is  indeed  the  optimal  rate  of 
growth  we  reported  in  figure  (3)  the  generalization  error 
as  function  of  the  number  of  examples  1  for  different 
rate  of  growth  of  n,  that  is  setting  n  —  V  for  different 
values  of  r.  Notice  that  the  exponent  e  =  i,  that  is  very 
similar  to  the  optimal  rate  of  eq.  (14),  performs  better 
than  larger  (t  =  |)  and  smaller  (r  =  jL)  exponents. 
While  a  fixed  sample  size  suggests  the  scheme  above  for 
choosing  an  optimal  network  size,  it  is  important  to  note 
that  for  a  certain  confidence  rate  (6)  and  for  a  fixed  error 
rate  (c),  there  are  various  choices  of  n  and  I  which  are 
satisfactory.  Fig,  4  shows  n  as  a  function  of  I,  in  other 
words  (1,  n)  pairs  which  yield  the  same  error  rate  with 
the  same  confidence. 

If  data  are  expensive  for  us,  we  could  operate  in  region 
A  of  the  curve.  If  network  size  is  expensive  we  could 
operate  in  region  B  of  the  curve.  In  particular  the  eco¬ 
nomics  of  trading  off  network  and  data  complexity  would 
yield  a  suitable  point  on  this  curve  and  thus  would  allow 
us  to  choose  the  right  combination  of  n  and  I  to  solve 
our  regression  problem  with  the  required  accuracy  and 
confidence. 


n  =  I 


I  (number  of  examples) 

Figure  3:  The  bound  on  the  generalization  error  as  a 
function  of  the  number  of  examples  for  different  choices 
of  the  rate  at  which  network  size  n  increases  with  sam¬ 
ple  size  1.  Notice  that  if  n  =  Z,  then  the  estimator  is  not 
guaranteed  to  converge,  i.e.,  the  bound  on  the  general¬ 
ization  error  diverges.  While  this  is  a  distribution  free- 
upper  bound,  we  need  distribution-free  lower  bounds  as 
well  to  make  the  stronger  claim  that  n  =  Z  will  never 
converge. 


Of  course  we  could  also  plot  the  error  as  a  function  of 
data  size  Z  for  a  fixed  network  size  (n)  and  this  has  been 
done  for  various  choices  of  n  in  Fig.  5. 


I 


Figure  5:  The  generalization  error  as  a  function  of  num¬ 
ber  of  examples  keeping  the  number  of  basis  functions 
(n)  fixed.  This  has  been  done  for  several  choices  of  n.  As 
the  number  of  examples  increases  to  infinity  the  general¬ 
ization  error  asymptotes  to  a  minimum  which  is  not  the 
Bayes  error  rate  because  of  finite  hypothesis  complexity 
(finite  n). 


Figure  4:  This  figures  shows  various  choices  of  (Z,  n) 
which  give  the  same  generalization  error.  The  *-2ixis 
has  been  plotted  on  a  log  scale.  The  interesting  obser¬ 
vation  is  that  there  are  an  infinite  number  of  choices  for 
number  of  basis  functions  and  number  of  data  points  all 
of  which  would  guarantee  the  same  genersdization  error 
(in  terms  of  its  worst  case  bound). 


We  see  as  expected  that  the  error  monotonically  de¬ 
creases  as  a  function  of  Z.  However  it  asymptotically 
decreases  not  to  the  Bayes  error  rate  but  to  some  value 
above  it  (the  approximation  error)  which  depends  upon 
the  the  network  complexity. 

Finally  figure  (6)  shows  the  result  of  theorem  (4.1) 
in  a  3-dimensional  plot.  The  generalization  error,  the 
network  size,  and  the  sample  size  are  all  plotted  as  a 
function  of  each  other. 

7  Conclusion 

For  the  task  of  learning  some  unknown  function  from 
labelled  examples  where  we  have  multiple  hypothesis 
classes  of  varying  complexity,  choosing  the  class  of  right 
complexity  and  the  appropriate  hypothesis  within  that 
class  poses  an  interesting  problem.  We  have  provided  an 
analysis  of  the  situation  and  the  issues  involved  and  in 
particular  have  tried  to  show  how  the  hypothesis  com¬ 
plexity,  the  sample  complexity  and  the  generalization 
error  are  related.  We  proved  a  theorem  for  a  special 
set  of  hypothesis  classes,  the  radial  basis  function  net¬ 
works  and  we  bound  the  generalization  error  for  certain 
function  learning  tasks  in  terms  of  the  number  of  param¬ 
eters  and  the  number  of  examples.  This  is  equivalent  to 
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A  Notations 


Figure  6:  The  generalization  error,  the  number  of  ex¬ 
amples  (i)  and  the  number  of  basis  functions  (n)  as  a 
function  of  each  other. 


obtaining  a  bound  on  the  rate  at  which  the  number  of 
parameters  must  grow  with  respect  to  the  number  of  ex¬ 
amples  for  convergence  to  take  place.  Thus  we  use  richer 
and  richer  hypothesis  spaces  as  more  and  more  data  be¬ 
come  available.  We  also  see  that  there  is  a  tradeoff  be¬ 
tween  hypothesis  complexity  and  generalization  error  for 
a  certain  fixed  amount  of  data  and  our  result  allows  us 
a  principled  way  of  choosing  an  appropriate  hypothesis 
complexity  (network  architecture).  The  choice  of  an  aj>- 
propriate  model  for  empirical  data  is  a  problem  of  long¬ 
standing  interest  in  statistics  and  we  provide  connections 
between  our  work  and  other  work  in  the  field. 


Acknowledgments  We  are  grateful  to  T.  Poggio  and 
B.  Caprile  for  useful  discussions  and  suggestions. 


•  A:  a  set  of  functions  defined  on  5  such  that,  for 
any  a  £ 


0  <  o(^)  <  Vies  . 

•  A(:  the  restriction  of  A  to  the  data  set,  see  eq. 

(22). 

•  B:  it  will  usually  indicate  the  set  of  all  possible 
/-dimensional  Boolean  vectors. 

•  S:  a  generic  e-separated  set  in  5. 

•  C(f,A,dii):  the  metric  capacity  of  a  set  endowed 
with  the  metric  dii(p). 

•  </(•,•):  a  metric  on  a  generic  metric  space  5. 

•  d£i(p)(-,  ■):  metrics  in  vector  spaces. 

The  definition  depends  on  the  space  on  which  the 
metric  is  defined  (Jk-th  dimensional  vectors,  real 
valued  functions,  vector  valued  functions). 

1.  In  a  vector  space  12*  we  have 

1  ' 

di,>(x,y)  ==  y  ^  1**“  -  y*! 

where  x,  y  £  12* ,  ***  and  denote  their  //-th 
components. 

2.  In  an  infinite  dimensional  space  3^  of  real  val¬ 
ued  functions  in  k  variables  we  have 

dmp)(f,9)=  f  l/(x)-s(x)l(iP(x) 

where  f,g  e  !F  and  dP(x)  is  a  probability 
measure  on  J2*. 

3.  In  an  infinite  dimensional  space  !F  of  func¬ 
tions  in  k  variables  with  values  in  i2”  we  have 


dL^[P)(f,^)  =  -  ^  /  l/i(x)-9<W|d-P(x) 

where 

<■(»:)  =  (/i(x),.../i(x),.../„(x)),  g(x)  = 
(j;i(x), . .  .Si(x), .  ..gn(x))  are  elements  of 
and  dP[ic.)  is  a  probability  measure  on  /2*. 

•  Di:  it  will  always  indicate  a  data  set  of  /  points: 


D,  =  {(x„yi)eX  . 

The  points  are  drawn  according  to  the  probability 
distribution  P(x,  y). 

•  £'[•]:  it  denotes  the  expected  value  with  respect  to 
the  probability  distribution  P(x,  y).  For  example 

I[f]  =  E[iy-f(x)f], 

and 
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ll/o  -  f\\h(P)  -  ■K[(/o(x)  -  /(x))2]  . 


/:  a  generic  estimator,  that  is  any  function  from 
X  to  y. 

/  :  A'  ^  y  . 

/o(x):  the  regression  function,  it  is  the  conditional 
mean  of  the  response  given  the  predictor: 

/o(x)  =  yP(y\^)  ■ 

It  can  also  be  defined  as  the  function  that  mini¬ 
mizes  the  expected  risk  /[/]  in  U,  that  is 

/o(x)  =  arg^in^/[/]  . 

Whenever  the  response  is  obtained  sampling  a 
function  h  in  presence  of  zero  mean  noise  the  re¬ 
gression  function  coincides  with  the  sampled  func¬ 
tion  h. 

it  is  the  function  that  minimizes  the  expected 
risk  /[/]  in  if„: 

/„  =  arg  inf  /[/] 

/C  iln 

Since 

^[f]  =  ll/o  -  +  -^i/o] 

fn  it  is  also  the  best  L^(P)  approximation  to  the 
regression  function  in  Hn  (see  figure  1). 

/„,(•  is  the  function  that  minimizes  the  empirical 
risk  /«mp[/]  in  Hn'. 

fn,i  =  arg  ,inf  4mp[/] 

In  the  neural  network  language  it  is  the  output  of 
the  network  after  training  has  occurred. 

T:  the  space  of  functions  to  which  the  regression 
function  belongs,  that  is  the  space  of  functions  we 
want  to  approximate. 

T  :X 

where  X  ^  and  Y  ^  R.  T  could  be  for  example 
a  set  of  differentiable  functions,  or  some  Sobolev 
space 

Q\  it  is  a  class  of  functions  of  k  variables 

:  i?*  -  [0,  y] 

defined  as 

a=={fl:s(x)  =  G(||x-t||),  teiZ*}- 
where  G  is  the  gaussian  function. 

Gi :  it  is  a  Is  -I-  2-dimensional  vector  space  of  func¬ 
tions  from  iZ*  to  R  defined  as 

Gi  =span{l,*S  *^,■,**‘,11x11^} 
where  x  €  .R*  and  is  the  p-th  component  of  the 
vector  X. 


•  Gz-  it  is  a  set  of  real  valued  functions  in  k  variables 
defined  as 

Gj  =  :  /  e  Gi,  Q  =  -7^} 

■v/STrtT 

where  <t  is  the  standard  deviation  of  the  Gaussian 

G. 

•  Hi',  it  is  a  class  of  vector  valued  functions 

g(x)  :  R*  R" 

of  the  form 

g(3c)  =  (G(||x-tj||),G(llx-t2l|),...,G(||x-t„||)) 

where  G  is  the  gaussian  function  and  the  tj  are 
arbitrary  fc-dimensional  vectors. 

•  Hp'-  it  is  a  class  of  real  valued  functions  in  n  vari¬ 
ables: 

/:  [0,yr  -R 

of  the  form 

/(x)  =/3-x 

where  /3  =  (/3i  is  an  arbitrary  n- 

dimensional  vector  that  satisfies  the  constraint 

n 

t=i 

•  Hn'.  a  subset  of  whose  elements  are 
parametrized  by  a  number  of  parameters  propor¬ 
tional  to  n.  We  will  assume  that  the  sets  R„  form 
a  nested  family,  that  is 

Hi  c  H2C  ...C  HnC  ...  . 

For  example  H„  could  be  the  set  of  polynomials 
in  one  variable  of  degree  n  —  1,  Radial  Basis  Func¬ 
tions  with  n  centers  or  multilayer  perceptrons  with 
n  hidden  units.  Notice  that  for  Radial  Basis  Func¬ 
tions  with  moving  centers  and  Multilayer  percep¬ 
trons  the  number  of  parameters  of  an  element  of 
Hn  is  not  n,  but  it  is  proportional  to  n  (respec¬ 
tively  n(k  -1-1)  and  n(k  +  2),  where  k  is  the  number 
of  variables). 

•  R:  it  is  defined  as  R  =  U*=i  if  is  identi¬ 

fied  with  the  approximation  scheme.  If  R„  is  the 
set  of  polynomials  in  one  variable  of  degree  n  —  1, 
R  is  the  set  of  polynomials  of  any  degree. 

•  R”*’'’(R*):  the  Sobolev  space  of  functions  in  k 
variables  whose  derivatives  up  to  order  m  are  in 
LP(R*). 

•  /[/]:  the  expected  risk,  defined  as 

I[f]  =  f  dxdy  P(x,  y){y  -  f{x)f  . 

JXxY 

where  /  is  any  function  for  which  this  expression 
is  well  defined.  It  is  a  measure  of  how  well  the 
function  f  predicts  the  response  y. 
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•  ^emp[/]:  the  empirical  risk.  It  is  a  functional  on  U 
defined  as 

»  =  1 

where  {(xj,  is  a  set  of  data  randomly  drawn 

from  X  X  Y  according  to  the  probability  distribu¬ 
tion  P(x,  y).  It  is  an  approximate  measure  of  the 
expected  risk,  since  it  converges  to  /[/]  in  proba¬ 
bility  when  the  number  of  data  points  I  tends  to 
infinity. 

•  k:  it  will  always  indicate  the  number  of  indepen¬ 
dent  variables,  and  therefore  the  dimensionality  of 
the  set  X. 

•  1:  it  will  always  indicate  the  number  of  data  points 
drawn  from  X  according  to  the  probability  distri¬ 
bution  P(x). 

•  L^(P):  the  set  of  function  whose  square  is  inte¬ 
grate  with  respect  to  the  measure  defined  by  the 
probability  distribution  P.  The  norm  in  L^(P)  is 
therefore  defined  by 

\\f\\h{P)  =  f  'P(x)/^(x)  . 

J 

•  A'"(/Z*)(Mo,  Afi,  M2, . . . ,  Mm)'  the  space  of  func¬ 
tions  in  k  variables  whose  derivatives  up  to  order 
m  are  bounded; 

\D°f\  <  M|„,  |a|  = 

where  a  is  a  multi-index. 

•  M:  a  bound  on  the  coefficients  of  the  gaussian  Ra¬ 
dial  Basis  Functions  technique  considered  in  this 
paper,  see  eq.  (12). 

•  M.{e,  S,d):  the  packing  number  of  the  set  S,  with 
metric  d. 

•  Af(e,S,d)'.  the  covering  number  of  the  set  5,  with 
metric  d. 

•  n;  a  positive  number  proportioned  to  the  number 
of  parameters  of  the  approximating  function.  Usu- 
edly  will  be  the  number  of  beisis  functions  for  the 
RBF  technique  or  the  number  of  hidden  units  for 
a  multilayer  perceptron. 

•  P(x):  a  probability  distribution  defined  on  X.  It 
is  the  probability  distribution  eiccording  to  which 
the  data  are  drawn  from  X. 

•  P(y(x):  the  conditional  probability  of  the  response 
y  given  the  predictor  x.  It  represents  the  proba¬ 
bilistic  dependence  of  y  from  x.  If  there  is  no  noise 
in  the  system  it  has  the  form  P{ylx)  =  6(y  —  h{x.)), 
for  some  function  h,  indicating  that  the  predictor 
X  uniquely  determines  the  response  y. 

•  P{x,  y):  the  joint  distribution  of  the  predictors  and 
the  response.  It  is  a  probability  distribution  on 
X  xY  and  has  the  form 


•  5:  it  will  usually  denote  a  metric  space,  endowed 
with  a  metric  d. 

•  5:  a  generic  subset  of  a  metric  space  S. 

•  T:  a  generic  f-cover  of  a  subset  S  C  5. 

•  U:  it  gives  a  bound  on  the  elements  of  the  class  A- 
In  the  specific  case  of  the  class  A  considere  in  the 
proof  we  have  U  =  1  +  Af  V^. 

•  U:  the  set  of  all  the  functions  from  A'  to  Y  for 
which  the  expected  risk  is  well  defined. 

•  F:  a  bound  on  the  Gaussian  basis  function  G: 

0  <  G(x)  <V  ,  Vx  G  R*  . 

•  A:  a  subset  of  R*,  not  necessarily  proper.  It  is  the 
set  of  the  independent  variables,  or  predictors,  or, 
in  the  language  of  neural  networks,  input  variables. 

•  x;  a  generic  element  of  X,  and  therefore  a  k- 
dimensional  vector  (in  the  neural  network  language 
is  the  input  vector). 

•  y :  a  subset  of  R,  whose  elements  represent  the 
response  variable,  that  in  the  neural  networks  lan¬ 
guage  is  the  output  of  the  network.  Unless  other¬ 
wise  stated  it  will  be  assumed  to  be  compact,  im¬ 
plying  that  .J"  is  a  set  of  bounded  functions.  In  pat¬ 
tern  recognition  problem  it  is  simply  the  set  {0, 1}. 

•  y:  a  generic  element  of  y,  it  denotes  the  response 
variable. 

B  A  Useful  Decomposition  of  the 
Expected  Risk 

We  now  show  that  the  function  that  minimizes  the  ex¬ 
pected  risk 

I[f]  =  f  ^’(x,  y)dxdy(y  -  /(x))^  . 

JxxY 

is  the  regression  function  defined  in  eq.  (3).  It  is  suffi¬ 
cient  to  add  and  subtract  the  regression  function  in  the 
definition  of  expected  risk: 

=  SxxY  d^<iyP(^^y)(y  -  /o(x)  -t-  /o(x)  -  /(x))^ 
=  fxxY^^<^yP(^’y)(y- Mx))^+ 

+  fxxY  dx.dyP{'x.,  y)(/o(x)  -  /(x))^  -f 

^fxxY  dxdyP(x,  y)(y  -  /o(x))(/o(x)  -  /(x)) 

By  definition  of  the  regression  function  /o(x),  the  cross 
product  in  the  last  equation  is  easily  seen  to  be  zero,  and 
therefore 

/[/]  =  /  dxR(x)(/o(x)  -  /(x))2  +  7[/o]  . 

JX 

Since  the  last  term  of  /[/]  does  not  depend  on  /,  the 
minimum  is  achieved  when  the  first  term  is  minimum, 
that  is  when  /(x)  =  /o(x). 


R(x,y)  =  P(x)P(y|x)  . 


In  the  case  in  which  the  data  come  from  randomly 
sampling  a  function  /  in  presence  of  additive  noise,  f, 
with  probability  distribution  V(()  and  zero  m.-an,  we 
have  P(s/|x)  =  V(y  -  /(x))  and  then 


I[fo]=  dxdyP(x,y)(y  -  fo(x))^  = 

(15) 

JXxY 

f  dxP(x)  f  {y  ~  f(x)fV(y  -  f(x))  = 

(16) 

JX  JY 

=  j  dxP(x)  J  e^V{e)df  =  <7^ 

(17) 

where  is  the  variance  of  the  noise.  When  data  are 
noisy,  therefore,  even  in  the  most  favourable  case  we 
cannot  expect  the  expected  risk  to  be  smaller  than  the 
variance  of  the  noise. 

C  A  Useful  Inequality 

Let  us  assume  that,  with  probability  1  —  ^  a  uniform 
bound  has  been  established: 

|/.mp[/]-/[/]|  <u.(/,n,^)  V/eff„. 

We  want  to  prove  that  the  following  inequality  also 
holds: 

\I[fn]-I[fn,l]\<Mhn,6)  .  (18) 

This  fact  is  easily  established  by  noting  that  since  the 
bound  above  is  uniform,  then  it  holds  for  both  /„  and 
fn,i,  and  therefore  the  following  inequalities  hold: 

/[/n.i]  </«mp[/n,l]+<^ 

lempifn]  <  Hfn]  +W 

Moreover,  by  definition,  the  two  following  inequalities 
also  hold: 

I[fn]  <  I[fn.l] 

-femp[/n,/]  ^ 

Therefore  tha  following  chain  of  inequalities  hold,  prov¬ 
ing  inequality  (18): 

I[fn]  <  I[h,l]  <  ■^emp  [/n,i]+a'  <  Itmp[fn]+^  <  f  [/n]+2w  . 

An  intutitive  explanation  of  these  inequalities  is  rdso  ex¬ 
plained  in  figure  (7). 

D  Proof  of  the  Main  Theorem 


2e  2e 

A 

'cmp  I  fn  )  *cinp  \  fn,J  ) 

Figure  7:  If  the  distance  between  /[/n]  and  I[fn,i]  is 

larger  than  2t,  the  condition  /emp[/n,i]  <  femp[/n]  is  vi¬ 
olated. 

definitions  and  notation  will  be  introduced  as  and  when 
the  necessity  arises. 

We  have  seen  in  section  2  (statement  2.1)  that  the 
generalization  error  can  be  bounded,  with  probability 
1  —  6,  as  follows: 

ll/o  “  /n,illi»(/>)  <  f(’»)  +  2u>(/,  n,6)  .  (19) 

In  the  next  parts  we  will  derive  specific  expressions  for 
the  approximation  error  e  and  for  the  estimation  error 
w  in  order  to  prove  theorem  (4.1). 

D.l  Bounding  the  approximation  error 

In  this  part  we  attempt  to  bound  the  approximation  er¬ 
ror.  In  section  3  we  assumed  that  the  class  of  functions 
to  which  the  regression  function  belongs,  that  is  the  class 
of  functions  that  we  want  to  approximate,  is 

-  {/  e  L2(if*)l/  =  a  *  G,  |A|«*  <  m}  , 

where  A  is  a  signed  Radon  measure  on  the  Borel  sets 
of  R*,  G  is  a  gaussian  function  with  range  [0,  V],  the 
symbol  *  stands  for  the  convolution  operation,  |A|/j»  is 
the  total  variation  of  the  measure  A  and  Af  is  a  positive 
real  number.  Our  approximating  family  is  the  class: 

n  n 

=  {/  e  I2I/  =  Yl  lAI  <  M  ,  ti  €  R*} 

t=i  i=i 

It  has  been  shown  in  [33,  34]  that  the  class  R„  uniformly 
approximate  elements  of  3^,  and  that  the  following  bound 
is  valid: 


The  theorem  will  be  proved  in  a  series  of  steps.  For  clar-  /  j  \ 

ity  of  presentation  we  have  divided  the  proof  into  four  ^[(fo  —  fn)^]  <  O  -  .  (20) 

parts.  The  first  takes  the  original  problem  and  breaks  it  '  ” ' 

into  its  approximation  and  estimation  components.  The  This  result  is  based  on  a  lemma  by  Jones  [48]  on  the 

second  and  third  parts  are  devoted  to  obtaining  bounds  convergence  rate  of  an  iterative  approximation  scheme 

for  these  two  components  respectively.  The  fourth  and  in  Hilbert  spaces.  A  formally  similar  lemma,  brought  to 

fined  part  comes  back  to  the  original  problem,  reassem-  our  attention  by  R.  Dudley  [25]  is  due  to  Maurey  and 

bles  its  components  and  proves  our  main  result.  New  was  published  by  Pisier  [65].  Here  we  report  a  version 
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of  the  lemma  due  to  Baiton  [6,  7]  that  contains  a  slight 
refinement  of  Jones’  result: 

Lemma  D.l  (Maurey-J ones- Barron)  If  f  is  in  the 

closure  of  the  convex  hull  of  a  set  Q  in  a  Hilbert  space  H 
with  llflil  <  b  for  each  g  ^  G,  then  for  every  n  >  1  and 
for  c  >  b^  —  there  is  a  /„  in  the  convex  hull  of  n 

points  in  Q  such  that 

11/  -/nil'  <  -  • 

n 

In  order  to  exploit  this  result  one  needs  to  define  suitable 
classes  of  functions  which  are  the  closure  of  the  convex 
hull  of  some  subset  ^  of  a  Hilbert  space  H .  One  way 
to  approach  the  problem  consists  in  utilizing  the  integral 
representation  of  functions.  Suppose  that  the  functions 
in  a  Hilbert  space  H  can  be  represented  by  the  integral 

/(x)  =  /  Gt(^)da(i)  (21) 

Jm 

where  da  is  some  measure  on  the  parameter  set  At,  and 
Gt(x)  is  a  function  of  H  parametrized  by  the  parameter 
t,  whose  norm  ||Gt(x)||  is  bounded  by  the  same  number 
for  any  value  of  t.  If  da  is  a  finite  measure,  the  integral 
(21)  can  be  seen  as  an  infinite  convex  combination,  and 
therefore,  applying  lemma  (D.l)  one  can  prove  that  there 
exists  n  coefficients  Cj  and  n  parameter  vectors  t,-  such 
that 

ll/-E<^iGt,(x)|l^<0(^) 

i  =  l 

For  the  class  T  we  consider,  it  is  clear  that  functions 
in  this  class  have  an  integral  representation  of  the  type 
(21)  in  which  Gt(x)  =  G(x  — t),  and  the  work  in  [33,  34] 
shows  how  to  apply  lemma  (D.l)  to  this  class. 

Notice  that  the  bound  (20),  that  is  similar  in  spirit  to 
the  result  of  A.  Barron  on  multilayer  perceptrons  [6,  8], 
is  interesting  because  the  rate  of  convergence  does  not 
depend  on  the  dimension  d  of  the  input  space.  This  is 
apparently  unusual  in  approximation  theory,  because  it 
is  known,  from  the  theory  of  linear  and  nonlinear  widths 
[78,  64,  54,  55,  20,  19,  21,  56],  that,  if  the  function  that 
has  to  be  approximated  has  d  variables  and  a  degree  of 
smoothness  s,  we  should  not  expect  to  find  an  approxi¬ 
mation  technique  whose  approximation  error  goes  to  zero 
faster  than  0(n~'i).  Here  “degree  of  smoothness”  is  a 
measure  of  how  constrained  the  class  of  functions  we  con¬ 
sider  is,  for  example  the  number  of  derivatives  that  are 
uniformly  bounded,  or  the  number  of  derivatives  that  are 
integrable  or  square  integrable.  Therefore,  from  classi¬ 
cal  approximation  theory,  we  expect  that,  unless  certain 
constraints  are  imposed  on  the  class  of  functions  to  be 
approximated,  the  rate  of  convergence  will  dramatically 
slow  down  as  the  number  of  dimensions  increases,  show¬ 
ing  the  phenomenon  known  as  “the  curse  of  dimension¬ 
ality”  [11]. 

In  the  case  of  class  T  we  consider  here,  the  constraint 
of  considering  functions  that  are  convolutions  of  Radon 
measures  with  Gaussian  seems  to  impose  on  this  class  of 
functions  an  amount  of  smoothness  that  is  sufficient  to 


guarantee  that  the  rate  of  convergence  does  not  become 
slower  and  slower  as  the  dimension  increases.  A  longer 
discussion  of  the  “curse  of  dimensionality”  can  be  found 
in  [34]. 

We  notice  also  that,  since  the  rate  (20)  is  independent 
of  the  dimension,  the  class  A,  together  with  the  approx¬ 
imating  class  Hn,  defines  a  class  of  problems  that  are 
“tractable”  even  in  a  high  number  of  dimensions. 

D.2  Bounding  the  estimation  error 

In  this  part  we  attempt  to  bound  the  estimation  error 
m  —  /emp[/]|-  In  order  to  do  that  we  first  need  to 
introduce  some  basic  concepts  and  notations. 

Let  5  be  a  subset  of  a  metric  space  S  with  metric  d. 
We  say  that  an  e-cover  with  respect  to  the  metric  d  is 
a  set  T  €  5  such  that  for  every  s  £  S,  there  exists  some 
t  £  T  satisfying  d(s,t)  <  e.  The  size  of  the  smallest 
e-cover  is  ^(e,  5,  d)  and  is  called  the  covering  number 
of  <S.  In  other  words 

Af{e,  S,  d)  =  min  |T|  , 

TcS 

where  T  runs  over  all  the  possible  e-cover  of  S  and  \T\ 
denotes  the  cardinality  of  T. 

A  set  B  belonging  to  the  metric  space  S  is  said  to 
be  e-separated  if  for  aU  jc,}/  €  B,  d(x,y)  >  e.  We 
define  the  the  packing  number  M.(e,S,d)  as  the  size  of 
the  largest  e-separated  subset  of  <5.  Thus 

M(e,  S,  d)  =  max  |B|  , 

where  B  tuns  over  all  the  e-separated  subsets  of  5.  It  is 
easy  to  show  that  the  covering  number  is  always  less  than 
the  packing  number,  that  is  A^(e,  S,  d)  <  M(e,5,(i). 

Let  now  P(^)  be  a  probability  distribution  defined  on 
S,  and  >4  be  a  set  of  teal-valued  functions  defined  on  S 
such  that,  for  any  a  £  A, 

0  <  a(()  <[/^  V(eS  . 

Let  also  $  =  ($i, ..,  (i)  be  a  sequence  of  I  examples  drawn 
independently  from  5  according  to  P(().  For  any  func¬ 
tion  a  £  A  we  define  the  empirical  and  true  expectations 
of  o  as  follows: 

1 

E[a\  = 

*  i  =  l 

E[a]  =  ^  d^PUMO 

The  difference  between  the  empirical  and  true  expecta¬ 
tion  can  be  bounded  by  the  following  inequality,  whose 
proof  can  be  found  in  [69]  and  [42],  that  will  be  crucial 
in  order  to  prove  out  main  theorem. 

Claim  D.l  ([69],  [42])  Let  A  and  i  be  as  defined 
above.  Then,  for  all  e  >  0, 

p(aaEA:  -  £[o]l  >  e)  < 
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In  the  above  result,  is  the  restriction  of  A  to  the  data 
set,  that  is; 

^^-  =  {(a(6),...,a(0)):aG^}  ■  (22) 

The  set  A(  is  a  collection  of  points  belonging  to  the 
subset  [0,  Uy  of  the  /-dimensioned  euclidean  space.  Each 
function  a  in  ^  is  represented  by  a  point  in  A^,  while 
every  point  in  A(  represents  all  the  functions  that  have 
the  same  values  at  the  points  The  distance 

metric  inequality  above  is  the  standard  L* 

metric  in  R\  that  is 

1  ' 

dL^(-x.,y)  =  y  ^ 

where  x  and  y  are  points  in  the  /-dimensional  euclidean 
space  and  as'*  and  are  their  //-th  components  respec¬ 
tively. 

The  above  inequality  is  a  result  in  the  theory  of  uni¬ 
form  convergence  of  empirical  measures  to  their  under¬ 
lying  probabilities,  that  has  been  studied  in  great  detail 
by  Pollard  and  Vapnik,  and  similar  inequalities  can  be 
found  in  the  work  of  Vapnik  [81,  82,  80],  although  they 
usually  involve  the  VC  dimension  of  the  set  A,  rather 
than  its  covering  numbers. 

Suppose  now  we  choose  S  =  X  x  Y ,  where  X  is  an 
arbitrary  subset  of  iZ*  and  Y  =  [— M,  M]  as  in  the  for¬ 
mulation  of  our  original  problem.  The  generic  element 
of  S  will  be  written  as  ^  =  {x,y)  £  X  x  Y .  We  now 
consider  the  class  of  functions  A  defined  as: 

^  =  {a;Xxy-iZ|a(x,y)=  (y-Mx))^ 

where  is  the  class  of  fc-diraensional  Raditd  Basis 

Functions  with  n  basis  functions  defined  in  eq.  12  in 
section  3.  Clearly, 

\y-h(x)\<\y\  +  \h(x)\<M +  MV, 

and  therefore 

0  <  a  < 

where  we  have  defined 

U  =  M  +  MV  . 

We  notice  that,  by  definition  of  E{a)  and  £'(o)  we  have 
1  ' 

^(«)  =  ~  =  I^mp[h] 

*  =  1 

and 

E{a)  =  f  dxdy  P(x,  y)(y  -  h(x)f  =  I[h]  . 
JXxY 

Therefore,  applying  the  inequality  of  claim  D.l  to  the 
set  A,  and  noticing  that  the  elements  of  A  are  essentially 
defined  by  the  elements  of  H„,  we  obtain  the  following 
result; 


P(\/h£Hn,%n,p[h]-I[h]\<€)> 

>  1  -  AE[M(e/l6,A^,di.)]e~r^‘  ‘  . 

so  that  the  inequality  of  claim  D.l  gives  us  a  bound  on 
the  estimation  error.  However,  this  bound  depends  on 
the  specific  choice  of  the  probability  distribution  P(x,  y), 
while  we  are  interested  in  bounds  that  do  not  depend  on 
P.  Therefore  it  is  useful  to  define  some  quantity  that 
does  not  depend  on  P,  and  give  bounds  in  terms  of  that. 

We  then  introduce  the  concept  of  metric  capacity 
of  A,  that  is  defined  as 

C(e,A,di,)  -  sup{J^(€,A,di,(p))} 
p 

where  the  supremum  is  taken  over  all  the  probability 
distributions  P  defined  over  5,  and  diup'f  is  standard 
L^{P)  distance® 

induced  by  the  probability  distribution  P: 

dinp){ai,a2}  =  J  —  02(01  01,02  C -4. 

The  relationship  between  the  covering  number  and  the 
metric  capacity  is  showed  in  the  following 

Claim  D.2 

E[Af{€,A(,di,)]  <  C{(,A,dL,)  . 

Proof:  For  any  sequence  of  points  (  in  S,  there  is  a  triv¬ 
ial  isometry  between  {A^,dii)  and  (Adx,i(Pj))  where 
Pj  is  the  empirical  distribution  on  the  space  5  given 

7S!=i^(^  ~  ^i)-  Here  S  is  the  Dirac  delta  func¬ 
tion,  ^  G  S,  and  is  the  i-th  element  of  the  data 
set.  To  see  that  this  isometry  exists,  first  note  that 
for  every  element  a  £  A,  there  exists  a  unique  point 
(o(0)>  •  •  • ,  o(^j))  G  A^.  Thus  a  simple  bijective  mapping 
exists  between  the  two  spaces.  Now  consider  any  two 
elements  g  and  h  of  A.  The  distance  between  them  is 
given  by 

f  1  * 

dLriPf)(9,h)=  /  = 

i  =  l 

This  is  exactly  what  the  distance  between  the  two  points 
(»(6).”,3(0))  and  (/»($i),  -,^(6)),  which  are  elements 
of  A(,  is  according  to  the  distance.  Thus  there  is 

*Note  that  here  ,4  is  a  class  of  real- valued  functions  de¬ 
fined  on  a  general  metric  space  S.  If  we  consider  €m  arbitrary 
A  defined  on  5  and  taking  values  in  P",  the  d^^qpj,  norm  is 
appropriately  adjusted  to  be 

n  - 

1=1 

where  f(x)  =  (/i(x), . . . /i(x), . . . /„(x)),  g(x)  = 

(yi(x),. .  .yd*),  •  •  •9n(*))  are  elements  of  A  and  P(x)  is  a 
probability  distribution  on  S.  Thus  and  should 

be  interpreted  according  to  the  context. 


a  one-to-one  correspondence  between  elements  of  A  and 
and  the  distance  between  two  elements  in  A  is  the 
same  as  the  distance  between  their  corresponding  points 
in  .  Given  this  isometry,  for  every  e-cover  in  A,  there 
exists  an  e-cover  of  the  same  size  in  A^,  so  that 


So  the  problem  reduces  to  finding  C(e,  ),  i.e.  the 

metric  capacity  of  the  class  of  appropriately  defined  Ra¬ 
dial  Basis  Functions  networks  with  n  centers.  To  do  this 
we  will  decompose  the  class  Hn  to  be  the  composition  of 
two  classes  defined  as  follows. 


Af(e,A^,dii)  =  ^/'(€,A,dl^p^))  <  C(e,A,dLi). 

and  consequently  E[Af(€,A(,dii)]  <  C(e,A,dii).  □ 

The  result  above,  together  with  eq.  (23)  shows  that  the 
following  proposition  holds; 

Claim  D.3 

P(VhGffn,|/.mp[h]-/[h]l  <e)> 

.  ,  (24) 

>  1  -  4C(f/16,^,  dii  )]e  ‘  . 

Thus  in  order  to  obtain  a  uniform  bound  w  on  |7emp[h]  — 
/[li]|,  our  task  is  reduced  to  computing  the  metric  capac¬ 
ity  of  the  functional  class  A  which  we  have  just  defined. 
We  will  do  this  in  several  steps.  In  Claim  D.4,  we  first 
relate  the  metric  capacity  of  A  to  that  of  the  class  of  ra¬ 
dial  basis  functions  H„.  Then  Claims  D.5  through  D.9 
go  through  a  computation  of  the  metric  capacity  of  Hn- 

Claim  D.4 

C{(,A,dl.)<Cie/iU,Hn,dL^) 

Proof:  Fix  a  distribution  P  on  S  =  X  x  Y.  Let  Px 
be  the  marginal  distribution  with  respect  to  X.  Sup¬ 
pose  K  is  an  e/dP-cover  for  Hn  with  respect  to  this 
probability  distribution  Px,  i  e.  with  respect  to  the  dis¬ 
tance  metric  dinp^^f  on  J7„.  Further  let  the  size  of  K  be 
Af((/AU,  Hn,dii(p^)).  This  means  that  for  any  h  €  Hn, 
there  exists  a  function  h*  belonging  to  K,  such  that: 


/ 


|/i(x)  -  h*{x.)\Px(x.)dx  <  (/AU 


Now  we  claim  the  set  H(K)  =  {(y  —  h(x))^  :  h  G  K} 
is  an  e  cover  for  A  with  respect  to  the  distance  metric 
dL'(P)-  To  see  this,  it  is  sufficient  to  show  that 

J\(y-k{x}f  -  (y-h*{x))^\P(x,y)dxdy  < 

<  J  2|(2y  -  h  -  /i*)||(/i  -  h*)\P(x,y)dxdy  < 

<  J  2(2M  +  2MV)\h  -  h*\P{x,y)dxdy  <  ( 
which  is  clearly  true.  Now 

Af(e,A,dLr^p))<\H(K)\^ 

=  calN(flAU,Hn,dii(Pj^^)  < 
<C(€/AU,Hn,di^.) 

Taking  the  supremum  over  all  probability  distributions, 
the  result  follows.  □ 
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Definitions  /  N  otations 

Hi  is  a  class  of  functions  defined  from  the  metric  space 
(R*,dii)  to  the  metric  space  (R",dii).  In  particular. 


Hi  =  {g(x)  =  (G(||x-ti|l),G(||x-t2l|), . . . ,  G(l|x-t„l|))} 

where  G  is  a  Gaussian  and  are  k-dimensional  vectors. 
Note  here  that  G  is  the  same  Gaussian  that  we  have  been 
using  to  build  our  Radial-Basis-Function  Network.  Thus 
Hi  is  parametrized  by  the  n  centers  <,•  and  the  variance 
of  the  Gaussian  in  other  words  nk  +  I  parameters  in 
all. 

Hp  is  a  class  defined  from  the  metric  space 
([0,  V]",  dx,i )  to  the  metric  space  (R,dn).  In  particu¬ 
lar, 


Hp  =  {h(x)  =  /3  •  X,  X  G  [0,  F]"  and  ^  |A1  <  M} 

t  =  i 

where  /3  =  {(3\, . . .  ,0n)  is  an  arbitrary  n-dimensional 
vector. 

Thus  we  see  that 

Hn  =  {hp  o  hj  :  hp  G  Hp  and  hi  G  Hj} 

where  o  stands  for  the  composition  operation,  i.e.,  for 
any  two  functions  /  and  g,  f  o  g  =  /(y(x)).  It  should 
be  pointed  out  that  Hn  as  defined  above  is  defined  from 
R*  to  R. 

Claim  D.5 

Proof:  Fix  a  probability  distribution  P  on  R!‘.  Consider 
the  class 

g  =  {y:y(x)  =  G(||x-t||),  t  G  R*}- 

Let  K  be  an  Af((,Q,dinp))-sized  e  cover  for  this  class. 
We  first  claim  that 

T  =  {{hi,..,hn)-.hieK} 

is  an  -'-cover  for  Hi  with  respect  to  the  d^^p)  metric. 

Remember  that  the  dii^^pf  distance  between  two 
vector-valued  functions  g(x)  =  (yi(x), ..,  y„(x))  and 
g*(x)  =  (yt(x),  ..,y;(x))  is  defined  as 

dL^p)[g,g*)  =  /  |yi(x) -y;(x)|R(x)dx 

To  see  this,  pick  an  arbitrary  g  =  (yi,...,yn)  £  Hi- 
For  each  gi,  there  exists  ay*  G  R  which  is  e-close 


in  the  appropriate  sense  for  real-valued  functions,  i.e. 
dmP){9i^9i)  £  The  function  g  =  i9l^  -y9n]  ‘s  an 
element  of T.  Also,  the  distance  between  (gi,..,3n)  and 
in  thed£i(/>)  metric  is 


Claim  D.6 


C(c,e,di.)  S  2 


( 


2eV 


In 


€ 


ik^2) 


f  • 

n 

i  =  l 

Thus  we  obtain  that 

A/(e,  Hi,dinp))  <  [A/'(c,5,dii{P))]" 
and  taking  the  supremum  over  all  probability  distribu¬ 
tions  as  usual,  we  get 

c{€,Hi,dL.)  <  (C(c.g,di.)r  ■ 

Now  we  need  to  find  the  capacity  of  Q.  This  is  done  in 
the  Claim  D.6.  From  this  the  result  follows.  □ 

Definitions  /  N  otations 


Proof:  Consider  the  k  -I-  2-dimensional  vector  space  of 
functions  from  iZ*  to  R  defined  as 

Gi  =  span{l,r*,r^,-,r*,|jx||^} 

where  x  €  and  is  the  /i-th  component  of  the  vector 
X.  Now  consider  the  class 

G2  =  {oe  ^  ■  f  &  Gi,  a  -  y—  } 

v27r<T 

where  cr  is  the  standard  deviation  of  the  Gaussian,  and 
f  £  Gi-  We  claim  that  the  pseudo-dimension  of  Q  de¬ 
noted  by  pdim(C?)  fulfills  the  following  inequality. 


Before  we  proceed  to  the  next  step  in  our  proof,  some 
more  notation  needs  to  be  defined.  Let  >1  be  a  fam¬ 
ily  of  functions  from  a  set  S  into  R.  For  any  sequence 
i  —  of  points  in  S,  let  be  the  restriction 

of  ^  to  the  data  set,  as  per  our  previously  introduced 
notation.  Thus  =  {(a{^i ),...,  a(^d))  :  a  €  >1}.  If 
there  exists  some  translation  of  the  set  A(,  such  that 
it  intersects  all  2*^  orthants  of  the  space  R'^,  then  (  is 
said  to  be  shattered  by  A-  Expressing  this  a  little  more 
formally,  let  B  be  the  set  of  all  possible  1-dimensional 
boolean  vectors.  If  there  exists  a  translation  t  £  R^ 
such  that  for  every  b  G  B,  there  exists  some  function 
G  ^  satisfying  -  tj  >  hi  6i  =  1  for  all  t  =  1 

to  d,  then  the  set  (|i,..,^d)  is  shattered  by  Note  that 
the  inequality  could  easily  have  been  defined  to  be  strict 
and  would  not  have  made  a  difference.  The  largest  d 
such  that  there  exists  a  sequence  of  d  points  which  are 
shattered  by  A  is  said  to  be  the  pseudo-dimension  of  A 
denoted  by  pdinuA.  □ 

In  this  context,  there  are  two  important  theorems  which 
we  wiU  need  to  use.  We  give  these  theorems  without 
proof. 

Theorem  D.l  (Dudley)  Let  F  be  a  k- dimensional 
vector  space  of  functions  from  a  set  S  into  R.  Then 
pdim{F)  =  k. 

The  following  theorem  is  stated  and  proved  in  a  some¬ 
what  more  general  form  by  Pollard.  Haussler,  using  tech¬ 
niques  from  Pollard  has  proved  the  specific  form  shown 
here. 

Theorem  D.2  (Pollard,  Haussler)  Let  F  be  a  fam¬ 
ily  of  functions  from  a  set  S  into  [Mi,  M2],  where 
pdim{F)  =  d  for  some  1  <  d  <  00.  Let  P  be  a  prob¬ 
ability  distribution  on  S.  Then  for  all  0  <  e  <  M2  —  Mi, 

M(€,F,diqp))  <  2  (^Ue(M2  -  Ml) log  he{M2  -  Mi)j 

Here  Ad(c,  F,dinp'j)  is  the  packing  number  of  F  accord¬ 
ing  to  the  distance  metric  <ii>(r>). 


pdim  (Q)  <  pdim  (G2)  =  pdim  (Gi)  =  (k  -i-  2). 

To  see  this  consider  the  fact  that  Q  C  G2-  Conse¬ 
quently,  for  every  sequence  of  points  x  =  (xi, . . .  ,Xd), 
die  C  {G2)±-  Thus  if  (xi,...,xj)  is  shattered  by  Q,  it 
will  be  shattered  by  G2.  This  establishes  the  first  in¬ 
equality. 

We  now  show  that  pdim(G2)  <  pdim(Gi).  is 
enough  to  show  that  every  set  shattered  by  G2  is 
also  shattered  by  Gi.  Suppose  there  exists  a  sequence 
(xi,X2,...,Xd)  which  is  shattered  by  G2.  This  means 
that  by  out  definition  of  shattering,  there  exists  a 
translation  i  £  R^  such  that  for  every  boolean  vec¬ 
tor  b  G  {0,1}*^  there  is  some  function  =  ae  -^b 
where  fy^  G  Gi  satisfying 

if  =  1,  where  U  and  ate  the  t-th  components 
of  t  and  b  respectively.  First  notice  that  every  func¬ 
tion  in  G2  is  positive.  Consequently,  we  see  that  ev¬ 
ery  ti  has  to  be  greater  than  0,  for  otherwise,  gh(xi) 
could  never  be  less  than  1,-  which  it  is  required  to  be 
if  6,  =  0.  Having  established  that  every  1,  is  greater 
than  0,  we  now  show  that  the  set  (xi,X2, . . .  ,x,2)  is 
shattered  by  Gi.  We  let  the  translation  in  this  case  be 
t'  =  (log(fi/Q),log(<2/«),...,log(<d/a)).  We  can  take 
the  log  since  the  U/a’s  are  greater  than  0.  Now  for  ev¬ 
ery  boolean  vector  b,  we  take  the  function  — /(,  £  Gi  and 
we  see  that  since 

9b  =  6;  =  1. 

if  follows  that 

-fb  >  log(<,/a)  =  t'i  bi  =  1. 

Thus  we  see  that  the  set  (xi ,  X2, . . . ,  xj)  can  be  shattered 
by  Gi.  By  a  similar  argument,  it  is  also  possible  to  show 
that  pdim(Gi)  >  pdim(G2). 

Since  Gi  is  a  vector  space  of  dimensionality  k  +  2,  an 
application  of  Dudley’s  Theorem  [24]  yields  the  value 
fe  -f  2  for  its  pseudo-dimension.  Further,  functions  in 
the  class  Q  are  in  the  range  [0,  V].  Now  we  see  (by  an 
application  of  Pollard’s  theorem)  that 


Claim  D.9 


<2(lf  < 

<  2  (If  In  (?f 

Taking  the  supremum  over  all  probability  distributions, 
the  result  follows. □ 

Claim  D.7 

rr  .  ^  (AMeV\Y 

C(€,  Hp,  di^)  <2  ( — - —  In  ( — - —  j  j 


v 


IMn  1 

Proof:  Fix  a  distribution  P  on  P*.  Assume  we  have 
an  f/(2A/n)-cover  for  Hi  with  respect  to  the  probability 
distribution  P  and  metric  dn^py  Let  it  be  K  where 


|A'|  =  A^(f/2A/n,P/,di.(p)). 

Now  each  function  f  &  K  maps  the  space  P*  into  f?", 
thus  inducing  a  probability  distribution  Pf  on  the  space 
R" .  Specifically,  Pj  can  be  defined  as  the  distribution 
obtained  from  the  measure  /x/  defined  so  that  any  mea¬ 
surable  set  A  C  R”  will  have  measure 


Proof:  The  proof  of  this  runs  in  very  similar  fashion. 
First  note  that 

Pf  C  {/3  •  X  :  X,  /3  €  P"}. 

The  latter  set  is  a  vector  space  of  dimensionality  n  and  by 
Dudley’s  theorem[24],  we  see  that  its  pseudo-dimension 
pdim  is  n.  Also,  clearly  by  the  same  argument  as  in  the 
previous  proposition,  we  have  that  pdim(PF)  <  n.  To 
get  bounds  on  the  functions  in  Pf ,  notice  that 


=  /  P(x)dx  . 

Further,  there  exists  a  cover  Kj  which  is  an  f /2-cover 
for  Pf  with  respect  to  the  probability  distribution  Pf. 
In  other  words 

|P^l-A^(c/2,Pf,di.(f,)). 

We  claim  that 


»=i  .=1  «=i 

Thus  functions  in  Pf  are  bounded  in  the  range 
[-MV,  MV].  Now  using  Pollard’s  result  [42],  [69],  we 
have  that 

J^{(,HF,dii(p))  <  M(e,  Hptdiup))  < 

Taking  supremums  over  all  probability  distributions,  the 
result  follows.  □ 

Claim  D.8  A  uniform  first-order  Lipschitz  bound  of 
Hp  is  Mn. 

Proof:  Suppose  we  have  x,  y  €  P"  such  that 

<  «• 

The  quantity  Mn  is  a.  uniform  first-order  Lipschitz 
bound  for  Hp  if,  for  any  element  of  Hp,  parametrized 
by  a  vector  /3,  the  following  inequality  holds: 

|x  •  —  y  •  /3|  <  Afne 

Now  clearly, 

|x  •  -  y  •  /3|  =  I  X)r=i  A(*i  -  yi)\  < 

<  E”=i  IAI1(*<  -  yi)l  < 

<  ^  E”=i  l(*i  -  yi)l  < 

The  result  is  proved.  □ 


H(K)  =  {fog  :geK  and  f  e  Kg} 
is  an  e  cover  for  P„.  Further  we  note  that 

=  E/eic  1^/1  <  E/6ifC(e/2,Pf,di.)  < 

<  U{e/(2Mn),  Hj,  di,^p^)C(e/2,  Hp,di. ) 

To  see  that  H{K)  is  an  f-cover,  suppose  we  are  given  an 
arbitrary  function  hf  o  h,  €  H„.  There  clearly  exists  a 
function  h*  £  K  such  that 

/  dii(hi{x.),hi(x))P{x.)dx  <  €/{2Mn) 

Jr>’ 

Now  there  also  exists  a  function  h}  £  such  that 
/«<.  I*/  °  *.•(*)  /8;(x)|P(x)dx  = 

=  /f"  l*/(y)  -  ^/(y)IAi:(y)dy  <  «/2  . 

To  show  that  H{K)  is  an  e-cover  it  is  sufficient  to  show 
that 

f  \hf  o  hi{x)  —  hf  o  h*(x)\P(x)dx  <  e. 

Jr'- 

Now 

\hf  o  hi(x)  -h)o  h;(x)|P(x)dx  < 

<  o  A(x)  -hfo  h*i(x)\-\- 

+  \hf  o  hl{x)  -h)o  fe;(x)|P(x)dx} 

by  the  triangle  inequality.  Further,  since  hj  is  Lipschitz 
bounded, 

20 


for  constants  A,B.  The  latter  inequality  is  satisfied  as 
\h,  o  hi(^)  -  h,  o  h*(x)lP(x)dx  <  ^ 

<  Mndii{hi(x),h*(x.))P(x)dii  <  Mn((/'2Mn)  <  f/‘2  .  ( An/e 'e  '  ^ ^  - 

Also,  *'"P‘*^* 

f  II.  I..I  V  I..  Ml  UP!  1.7  2n(fc  +  3)(ln(An)-ln{e))~e-//B  :,ln(6/4) 

|A/  o  /inx)  -  o  h-(x)|B(x)dx  = 

=  /fi"  l'‘/(y)  -  h)(y)\Ph-(y)dy  <  f/2  .  ^  Bln(4/6)  +  2Bn(k  +  3)(ln(An)  -  ln(f)). 

Consequently  both  sums  are  less  than  e/2  and  the  total  We  now  show  that  the  above  inequality  is  satisfied  for 

integral  is  less  than  e.  Now  we  see  that 


Ar(e,ff„,di.(p))  <;\/(e/(2Mn),/f/,di.(p))qe/2,/fF,dM)"  (  /  J 

Taking  supremums  over  all  probability  distributions,  the  P“tting  the  above  value  of  e  in  the  inequality  of  interest. 


B  [ln(4/6)  +  2n(fe  +  3)  ln(  An)  +  n(k  +  3)  ln(/)] 
. 


result  follows.  □ 


we  get 


Having  obtained  the  crucial  bound  on  the  metric  capac-  ^  U/B)  —  ln(4/i)  +  2n{k  +  3)ln(An)  +  n{k  +  3)ln(i)  > 
ity  of  the  class  if„,  we  can  now  prove  the  following 


Claim  D.IO  With  probability  1-6,  and  Vh  €  ff„,  the 
following  bound  holds: 


>  ln(4/S)  +  2n(k  +  3)ln(An)+ 


B[ln(4/4)+2n()i+3)ln(4n)+n()t  +  3)ln(l)] 


nk  ln(n/)  +  ln(l/6)1 


|/.mp[h]  -  I[h]\  <  O 


Proof:  We  know  from  the  previous  claim  that 
C(e,Hn,di^)  < 


+2n(Jb  +  3)lln 
In  other  words, 

n(k  +  Z)ln{l)  > 


>  n(k  +  3)  In  (^„(4/i)  +  2„(*  +  3)in(4„)  +  „ 


(k  +  3)ln(/)) 


<-  on+i  r4Af«Vn  i  1 4MeVn \  1  [ 8M«v  /sMeV  \i"  <  B  [ln(4/6)  +  2n(fc  +  3) ln( An)  +  n(A:  +  3)ln(/)]  >  1 

—  VeU  L«  v«/J  —  the  inequality  is  obviously  true  for  this  value  of  e.  Taking 

this  value  of  e  then  proves  our  claim.  □ 


^  ^SAftVn  ^j^^8AfeVn^j”(*  +  3) 

From  claim  (D.3),  we  see  that 

P{'iheH„,\h^p[h]-I[h]\<e)> 


>1-6 


as  long  as 


C(e/l6,A,dii)e  ‘  <  - 


which  in  turn  is  satisfied  as  long  as  (by  Claim  D.4) 


C(e/64C/,ff„,di.)e~^‘''  < 


which  implies 


D.3  Bounding  the  generalization  error 
Finally  we  are  able  to  take  our  results  in  Parts  II  and  III 
to  prove  our  main  result; 

(25)  Theorem  D.3  With  probability  greater  than  1  —  6  the 
following  inequality  is  valid: 

m2  ^ /rnifeln(nl) -ln6l^/^\ 

ll/o  -  /n,(|li  =  (J>)  £  j  +  O  ^  j 

^  Proof:  We  have  seen  in  statement  (2.1)  that  the  gener¬ 

alization  error  is  bounded  as  follows: 

ll/o  —  /n,i||i,j(p)  £  £■(”■)  +  2w(l, n, 6)  . 

In  section  (D.l)  we  showed  that 


(i256MeyCAn  In  (i256MeV'[^n)) 


n(l<+3) 


(„)  =  £>(!) 


■e  *  <  I 


and  in  claim  (D.IO)  we  showed  that 


In  other  words. 


(v‘»{f)) 


i{k+3)  c 

4 


/.  ^  _  /  Fn^j  ln(n/)  —  In^l  ^  \ 

a,(/,n,6)  =  0M - -  j  . 

Therefore  the  theorem  is  proved  putting  these  results 
together.  O 
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